VSP on Tweet
Introduction
This is a hands-on example of the vsp library. A sampled Twitter data-set of tweets under #coronavirus was used for demonstration.
Methods
Libraries
Required libraries are dplyr, vsp, tidytext, purrr, and blogdown. The latter can be installed through devtools.
#load libraries
library(dplyr)
#devtools::install_github("RoheLab/vsp")
library(vsp)
library(tidytext)
library(purrr) #for map function
library(blogdown) #for shortcodes in rmarkdownData Source & Data Description
Tweets were collected from Twitter API, and the twitter accounts were further divided into flocks, which is the same as the murmuration website. The data-set can be downloaded from here. Here we name the data-set “covid”.
Data description:
- Date: YYYY/MM/DD. Collected from Feb 1st to March 31st.
- user_id: the id of the user that tweeted
- status_id: the id of the tweet post
- screen_name: the @name shown to the public (e.g. @realdonaldtrump’s screen name is realdonaldtrump)
- flock_category: indicates the category of account clusters. There are 6 categories: liberals, conservatives, media, issue-centric, pop culture, academia
#top several lines of the data-set
head(covid)Data Cleaning
1. For convenience, split the data by flock, and store it as a list:
covid_flockls <- split(covid, f = covid$flock_category)[1:6]2. Extract the tweet text for each flock from the list generated by step 1:
covid_textls <- lapply(covid_flockls, function(x) {
tibble(tweet = 1:nrow(x), text = x$text)
})3. Further, for each flock, unnest the tweet text into tokens:
covid_ttls <- lapply(covid_textls, function(x) {
x %>% unnest_tokens(word, text)
})
#first several lines of the liberal flock
head(covid_ttls[['liberals']])Run VSP
1. Make sparse matrix for each flock:
covid_matrix <- lapply(covid_ttls, function(x) {
dt <- cast_sparse(x, tweet, word)
})
#first 20 rows and columns of the liberal sparse matrix
covid_matrix[['liberals']][1:20, 1:20]2. Run VSP on each flock (categorize tweets into 15 topics, k=15). We can use screeplot to visualize the singular values:
covid_fa <- lapply(covid_matrix, function(x) {
fa <- vsp(x, k = 15)
})
#screeplot for the liberals flock
screeplot(covid_fa[['liberals']])
#use pair plot to plot the factors (diagnostic measure)
plot_varimax_z_pairs(covid_fa[['liberals']], 1:5)3. For each flock, according to the VSP result, for each topic (i.e. column in the matrix), find the top 10 rows that have the highest score. The matrix Z shows how much tweet i belongs to topic j:
topTweets <- 10
topid_ls <- lapply(covid_fa, function(x) {
topid <- x$Z %>% apply(2, function(t)
which(rank(-t, ties.method = "random") <= topTweets))
})
#get the correspondent tweets
toptweet_ls <- lapply(seq_along(topid_ls), function(i) {
df <- topid_ls[[i]]
txtdf <- covid_flockls[[i]]
ls <- list()
for (j in 1:ncol(df)) {
name <- paste("topic", j)
ls[[j]] <- tibble(
tweet = txtdf$text[df[, j]],
status_id = txtdf$status_id[df[, j]],
screen_name = txtdf$screen_name[df[, j]]
)
names(ls)[j] <- name
}
return(ls)
})
names(toptweet_ls) <- names(topid_ls)4. If you want to embed the tweets into websites, you can mutate a column for tweet-post-urls:
get_embed <- function(status_id) {
api_result <- httr::GET(paste0(
"https://publish.twitter.com/oembed?url=https%3A%2F%2Ftwitter.com%2FInterior%2Fstatus%2F",
status_id
))
api_content <- httr::content(api_result)
html_content <- api_content[["html"]]
return(html_content)
}
toptweet_ls_withlink <- lapply(toptweet_ls, function(ls) {
lapply(ls, function(topic) {
topic %>%
mutate(status_url = paste0(
"https://twitter.com/", screen_name, "/status/", status_id
)) %>%
mutate(embed_url = map(status_id, get_embed))
})
})
#example of top tweets for topic 10 in flock liberals
toptweet_ls_withlink[['liberals']][['topic 10']]5. Finally, use short codes to display the twitter-post HTML widgets:
shortcodes("tweet",
toptweet_ls_withlink[['liberals']][['topic 10']]$status_id)