Detailed Guide
rtweet library
This section is the one for all the functions inside the rtweet library that I have learned to use. I provide understandable code and examples of how to use each function.
Scraping Twitter for data is a superpower, it is wise to wield it with caution in the name of data science
Twitter API Rate Limits
Returns the data for all Twitter API function calls
It is important to know your rate limits when calling functions, it is very important to avoid rate limiting.
The tibble that gets returned includes a time stamp of when you last ran a specific function call and shows when you are safe to call it again.
#--- returns a full list of functions and their rate limits
Rate_Limit = rate_limit()
#--- get rate limit info for specific token (function)
token <- get_tokens()
rate_limit(token)
rate_limit(token, "search_tweets")
Search
Search Tweets by User
Returns up to 90,000 statuses (tweets).
- grab tweets by
status_id
orscreen_name
- tweets must be <= 90,000
- MUST AVOID RATE LIMITS WHILE ITERATING every 15 min when grabbing >90,000 tweets
use
next_cursor()
to wait & scrape tweets every 15 min (p.46 in rtweet docs pdf)
Note: It is important to hold onto these status_id
numbers, as it is one sure way to retrieve old tweets without getting premium Twitter API involved which is required for older tweets.
This code is from the documentation, shows how that using a status_id can fetch old tweets
statuses <- c(
"567053242429734913", # Andrew Malcolm 2015-02-15
"266031293945503744", # Barack Obama 2012-11-07
"440322224407314432" # Ellen DeGeneres 2014-03-03
)
tweet_statuses = lookup_statuses(statuses)
tweet_statuses %>%
select(status_id, name, screen_name, user_id, created_at, text)
Search Multiple Users
Using a vector to grab tweets.
twitter_names = c("usr1",'usr2', 'usr3','usr4')
twitter_users_search = lookup_users(
users = twitter_names,
parse = TRUE
)
Search Multiple Queries
There are 2 methods for retrieving multiple Twitter search queries, but I will show you only one, the easiest way.
Note: there is search_tweets
and search_tweets2
, using the latter is more flexible.
This is how to search for 3 queries using a vector, and capturing 1000 tweets.
dataSci_tweets = search_tweets2(
c("data science","RStats","dataviz"),
n = 1000
)
#-- look at the dataframe
head(dataSci_tweets)
#--- look at each query tweet tally of the 3 queries
table(dataSci_tweets$query)
# Tally the 3 queries
## data science dataviz RStats
## 999 1000 1000
Get User’s Followers
Twitter API calls a Twitter follower a friend, here is how to search for and get a list of a user’s followers/ friends.
- 5,000 is rate limit max (default)
- “-1” = 1st page of JSON, else if user has >5,000
usr_friends = get_friends(
'<user_name>', # @<user_name>
n = 5000,
page = "-1",
parse= TRUE
)
This returns a dataframe with integer values for user_id
.
Get User’s Likes
Twitter API calls a user’s likes favorites, the {❤️}. The rate limit is <= 3,000 statuses / tweets.
usr_likes = get_favorites(
'<user_name>', # @<user_name>
n = 200,
parse = TRUE
)
Get a List of Users Likes
Use a vector for multiple twitter accounts to get each of their liked tweets.
users = c('usr1','usr2','usr3','usr4','usr5')
usr_faves = get_favorites( users, n = 400)
usr_faves %>% view()
usr_faves %>%
select(screen_name, text, favorite_count) %>% view()
List Search Tweets
Grab more than one Twitter user tweets with function lapply()
.
- Twitter users :
c('usr1','usr2','usr3','usr4')
list_tweets = lapply(c("usr1",
"usr2",
"usr3",
"usr4"),
search_tweets, # twitter dataframe variable used
n = 5000 # number of tweets
)
Note: list_tweets %>% view() doesn’t work for this, you need to callrbind()
the list into a dataframe
tweet_df = do_call_rbind( list_tweets )
now we have a df: tweet_df %>% view()
view the dataframe users_data( tweet_df ) %>% view()
Get Twitter Mentions of User
This returns <= 200 mentions of a twitter user. Returns the last 200 tweets you were tagged in a reply.
usr_mentions = get_mentions(
n = 200,
parse = TRUE
)
usr_mentions$text
Your Twitter Timeline
Returns your timeline, the ‘home’ Twitter tab if you were on the app.
The default of timeline tweets is 100, the checks
is for the rate limit.
my_twitter_ = get_my_timeline(
n = 100,
parse = TRUE,
check = TRUE
)
my_twitter_$screen_name
Get a User’s Timeline
Returns <= 3200 statuses (tweets) of single Twitter user.
The home
values are FALSE
for user-timeline and TRUE
for home-timeline.
user_timeline = get_timeline(
'<user_name>',
n = 100,
home = FALSE,
parse = TRUE,
check = TRUE
)
user_timeline %>%
select(screen_name, text)
Get Users Timelines
Returns <= 3200 statuses (tweets) for each Twitter user specified.
- users:
c('usr1','usr2', 'usr3')
group_timelines = get_timelines(
c('usr1','usr2', 'usr3'),
n = 100,
home = F,
parse = T,
check = T
)
group_timelines %>%
select(screen_name, text) %>% view()
group_timelines$text
Grab Direct Messages
Retrieve 50 of your direct messages in the last 30 days
direct_messages(n=50,
next_cursor = NULL,
parse = TRUE,
token = NULL)
Get Twitter Retweeters
Returns IDs of users who retweeted status The maximum of queries is 100 per request.
status_id
is required and is the long integer associated with a tweet (status).
retweeters = get_retweeters(
'< 19 integer numbers >',
n = 100,
parse = TRUE
)
retweeters
Get Retweets
Returns a collection of 100 recent retweets of a specific tweet (status). The maximum for queries is 100.
- One way of finding a
status_id
can be done by going to your own Twitter timeline or notifications and find a tweet that was retweeted, click on one of the account users who retweeted your tweet. Then scroll to find where it shows on their timeline of retweeting your tweet, click on it. Look at the url address bar to find the integer value.
re_tweets = get_retweets(
'<long integer number>',
n = 100,
parse = TRUE
)
re_tweets %>%
select(screen_name, text) %>% view()
Get Trends on Twitter
You want to stay up with the trends of today, this hour, this minute, you can get the latest trends by using get_trends()
, so trendy ! Returns Twitter trends for specific location.
- ‘city-name’ or ‘country-name’ can be used.
- Where On Earth ID (WOEID) can be used : e.g. Toronto: 4118
- can use lat = and lng = values
Here is some coordinates for starting:
- Vancouver, BC, Canada
49.2827, -123.1207
- Halifax, NS, Canada
44.651070, -63.582687
- Yellowknife, NT, Canada
62.453972, -114.371788
- Edmonton, AB, Canada
53.631611, -113.323975
This code is to show how to get the trending tweets for Vancouver, see if the tweets are promoted (ads) and the tweet volume.
trending = get_trends('canada')
trending %>%
select(trend, place, promoted_content, tweet_volume) %>%
arrange( desc(tweet_volume) )
get_city_trend = get_trends(lat = 49.28, # Vancouver
lng = -123.12)
get_city_trend
Twitter List Members
Returns users on a given list, memberships with given user.
- A
slug
is account name associated with a list, this is an option for function call list_id
is a numeric valueowner_user
is the account that created/ owns a list- query max limit is 4000 and is the default
Twitter account rstats List created by “@owenlhjphillips”
You can use slug
instead of list_id
with owner_user
membersList = lists_members(
list_id = "1................8", # option 1 (no slug)
# slug = 'rstats', # optional 2 (no list_id)
owner_user = "<usr_name>",
parse = TRUE,
n= 4000
)
membersList %>%
select(name,
screen_name,
location,
followers_count) %>%
view()
This will return a large list of twitter users who are members of a List.
Twitter user Lists memberships
Returns the lists a Twitter user is a member of.
usr_memberships = lists_memberships(
user = "<user_name>",
n = 200,
parse = TRUE
)
usr_memberships %>%
select(name, full_name) %>% view()
Timeline Tweets by User of a List
Returns a timeline of tweets of a list by user
include_rts
(optional) takesTRUE
orFALSE
for including retweetsparse
is by default set toTRUE
since_id
(optional) argument available which is for returning older tweets and is subject rate limitsmax_id
(optional) returns tweets that is older or equal to the specified ID- A
slug
is account name associated with a list, this is an option for function call
usr_list_timeline = lists_statuses(
slug = "<slug_name>",
owner_user = "<usr_name>",
n = 200,
parse = TRUE,
include_rts = FALSE
)
usr_list_timeline %>% view()
Twitter List Subscribers
There is a Twitter list and people subscribe to it, this function is to get who is on this list specified.
This example uses New York Times politics list subscribers
NYT_subs = lists_subscribers(
slug = "new-york-times-politics",
owner_user = "nytpolitics",
n= 1000
)
NYT_subs %>% head()
Twitter List Subscriptions by User
Returns a list of a user’s subscriptions. The what does Twitter user X subscribe to? is answered here.
- user is
user_id
orscreen_name
n
has maximum of 1000
usr_List_Subs = lists_subscriptions(
user = "<usr_name>",
n = 400,
parse= TRUE
)
usr_List_Subs %>% view()
Tidy Text
Tidy Twitter Text
You have searched Twitter for a hashtag, you saved the data and loaded it, this
where you can have clean Twitter text by using the plain_tweets()
function. This has 2 steps to get the clean text.
# step 1
twitter_df_text = searched_tweets$text
# step 2
clean_tweets_text = plain_tweets( twitter_df_text )
clean_tweets_text
Twitter Stop Words
Returns Twitter’s dataframe of stop words.
-
stopwordslang has 24,000 rows
-
words associated with 10 different languages:
including c("ar","en","es","fr","in","ja","pt","ru","tr","und")
. -
variables:
- word - potential stop words
- lang - 2 or 3 word code
- p - probability value associated with frequency (normal distribution), higher values mean word occurs more frequently (vice versa)
head(stopwordslangs)
Post
Post Direct Message
Posts a message to specified user in Messages
- use
screen_name
oruser_id
for targeted message
Send a message from RStudio to a user !
Note: using a variable name still runs the function, but this allows for you to see all the information regarding direct messages on Twitter
# ===== post a direct message to user
DM_message = post_message(
"Hi from RStudio {rtweet} message #1",
user = "<usr_name>",
media = NULL
)
Post a Tweet
Posts a tweet to a Twitter user
Tweet from RStudio !
status
(tweet) must be < 280 charactersmedia
is the file path for a image or video to be included in tweetdestroy_id
is used to delete a tweet, you need to provide a singlestatus_id
integer value in order for it to work
post_tweet(
status = "my 1st {rtweet} #rstats from Rstudio",
media = NULL,
token = NULL,
in_reply_to_status_id = NULL,
destroy_id = NULL,
retweet_id = NULL,
auto_populate_reply_metadata = FALSE
)
Post a Tweet to a Thread 🧵
Returns up to 3200 tweets posted to a timeline by 1 or more Twitter users
user
oruser_id
can be used- the
parse
argument is meant to save you anger and frustration when using this data, be kind to yourself and always set it toTRUE
, as it returns a parsed dataframe
##------ lookup status_id for my own timeline
my_timeline <- get_timeline(rtweet:::home_user())
my_timeline
##------ ID for reply, slice the first one (latest tweet) to get status_id integer
reply_id <- my_timeline$status_id[1]
reply_id
##------ post reply
post_tweet("second in the thread {rtweet}",
in_reply_to_status_id = reply_id)
Plots
#TidyTuesday plot
You can post your #TidyTuesday plots from your RStudio or any generated plots. This is an example from the documentation.
##---------- generate data to make/save plot (as a .png file)
x <- rnorm(300)
y <- x + rnorm(300, 0, .75)
col <- c(rep("#002244aa", 50), rep("#440000aa", 50))
bg <- c(rep("#6699ffaa", 50), rep("#dd6666aa", 50))
##--------- crate temporary file name
tmp <- tempfile(fileext = ".png")
##-------- save as png
png(tmp, 6, 6, "in", res = 127.5)
par(tcl = -.15, family = "Inconsolata",
font.main = 2, bty = "n", xaxt = "l", yaxt = "l",
bg = "#f0f0f0", mar = c(3, 3, 2, 1.5))
plot(x, y, xlab = NULL, ylab = NULL, pch = 21, cex = 1,
bg = bg, col = col,
main = "This image was uploaded by rtweet")
grid(8, lwd = .15, lty = 2, col = "#00000088")
dev.off()
##------- post tweet with media attachment
post_tweet("a tweet with media attachment {rtweet}", media = tmp)
Time series plot 1
Returns a ggplot2 time interval plot based on Twitter data. This is an example of
searching Twitter for the Trending #ClimateEmergency
.
by
== secs, mins, hours, days, months, years. The default is in seconds when integer is given
#--------- search for the #ClimateEmergency
# ClimateEmerg = search_tweets2(
# "ClimateEmergency",
# n = 10000
# )
ClimateEmerg %>% head()
#-- time series plot
ClimateEmerg_freq = ts_plot(ClimateEmerg, by="mins")
ClimateEmerg_freq
ClimateEmerg %>%
group_by(is_retweet) %>%
ts_plot("hours")
# Compare tweets by retweet or not
Time Series plot 1.1
Extract tweets from users data object (parsed data).
Use of tweets_data
to return a dataframe.
ClimateEmerg_users = tweets_data( users = ClimateEmerg )
ClimateEmerg_users
Parse the data into dataframes/ tibbles
tweets.and.users= tweets_with_users(ClimateEmerg)
tweets.and.users
Time series plot 2
Using the searched tweets dataframe (previously run Twitter search for tweets) we can use the time series plotting function to generate a plot. The ts_plot
is a time series frequency plot, you can use ggplot2 alone if you like or use it in conjunction with the ts_plot()
.
# Twitter searched: 'rstats' OR 'RStats' tweets by minute
ts_plot(rstats_searched_tweets, # dataframe used to scrape Twitetr
by = "mins", # min | secs | days | weeks | year
tz = "America/Edmonton", # your timezone
trim = 1) + # slice 1 min from start and end of data
Live Twitter
Returns Twitter data on query specified for duration set in function call
Returns public tweets, with 4 methods:
- 1 - small random sample of tweets available
- 2 - filtering using search query (<= 400 keywords)
- 3 - tracking vector of user ids (<= 5000 user_ids)
- 4 - geolocation coordinates
This function can be used for trends and to grab users data.
Note:
- the timeout function can be set to higher values
- the folder generated by rtweets will have a integer string in the name, where it stores the JSON file. Moving a copy to current working directory makes for use of file easier.
This data stream collection is on the Trend #ADayOffTwitch
which was a protest against online hate and harassment.
Twitter_LiveStream = stream_tweets2(
"ADayOffTwitch",
timeout = 90, # 30 sec is default
parse = TRUE,
verbose = TRUE,
file_name = "TwitterLiveTweets",
append = TRUE
# default is FALSE which overwrites pre-existing data
)
#-- parse
Twitter_LiveStream = parse_stream('TwitterLiveTweets.json')
#-- users data into dataframe
twitch_tweet_users = users_data(Twitter_LiveStream)
twitch_tweet_users %>% view()
# time series plot of tweets based on seconds
ts_plot( Twitter_LiveStream, "secs")