VSP on tweet

Introduction

This is a hands-on example of vsp library. A sampled twitter data-set of tweets under #coronavirus was used for demonstration.

Methods

Libraries

Required libraries are dplyr, vsp, tidytext, purrr, and blogdown. The latter can be installed through devtools.

#load libraries
library(dplyr)
#devtools::install_github("RoheLab/vsp")
library(vsp)
library(tidytext)
library(purrr)#for map function
library(blogdown)#for shortcodes in rmarkdown

Data Source & Data description

Tweets were collected from Twitter API, and the twitter accounts were further divided into flocks, which is same as the murmuration website. The data-set can be downloaded from here. Here we name the data-set “covid”.

Data description:
- Date: YYYY/MM/DD. The data was collected from Feb 1st to March 31st.
- user_id: the id of the user that tweeted
- status_id: the id of the tweet post
- screen_name: the @name that is shown to the public. eg. @realdonaldtrump’s screen name is realdonaldtrump.
- flock_category: indicates the category of account clusters. There are 6 categories: liberals, conservatives, media, issue-centric, pop culture, academia

#top several lines of the data-set
head(covid)
## # A tibble: 6 x 6
##   date       status_id   user_id   screen_name  text              flock_category
##   <date>     <chr>       <chr>     <chr>        <chr>             <fct>         
## 1 2020-02-01 1223721734… 32580016… BernieSande… "vermont sen. be… liberals      
## 2 2020-02-01 1223649688… 15150904  techweenie   "@amberd1116 @te… liberals      
## 3 2020-02-01 1223647866… 87096872… tweets4poppy "tom's patriot c… conservatives 
## 4 2020-02-01 1223719366… 89614375… CycleOfChaos "god bless this … conservatives 
## 5 2020-02-01 1223584239… 93815260… HLAurora63   "@twitter @jack … conservatives 
## 6 2020-02-01 1223611910… 87504275  rightlyalig… "wuhan doctors b… conservatives

Data Cleaning

  1. For convenience, split the data by flock, and store it as a list
covid_flockls<-split(covid,f=covid$flock_category)[1:6]
  1. Extract the tweet text for each flock from the list generated by step 1
covid_textls<-lapply(covid_flockls,function(x){
    tibble(tweet=1:nrow(x), text=x$text)
})
  1. Further, for each flock, unnest the tweet text into tokens
covid_ttls<-lapply(covid_textls, function(x){
    x %>% unnest_tokens(word, text)
})

#first several lines of the liberal flock, showing the word tokens in the #1 tweet
head(covid_ttls[['liberals']])
## # A tibble: 6 x 2
##   tweet word    
##   <int> <chr>   
## 1     1 vermont 
## 2     1 sen     
## 3     1 bernie  
## 4     1 sanders 
## 5     1 campaign
## 6     1 may

Run vsp

  1. Make sparse matrix for each flock
covid_matrix<-lapply(covid_ttls, function(x){
    dt<-cast_sparse(x,tweet,word)
})

#first 20 rows and columns of the liberal sparse matrix, 1 represents the ith and jth word appear together
covid_matrix[['liberals']][1:20,1:20]
## 20 x 20 sparse Matrix of class "dgCMatrix"
##                                           
## 1  1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . . . .
## 2  . . . . . . . . . . 1 . . . . . 1 1 1 1
## 3  . . . . . . . . . . . . . 1 1 . . . . .
## 4  . . . . . . 1 . 1 . 1 . . 1 1 . . . . .
## 5  . . . . . . . . . . 1 . . 1 1 . . . . .
## 6  . . . . . . . . . . 1 . . 1 1 . . . . .
## 7  . . . . . . . . . . 1 . . 1 1 . . . . .
## 8  . . . . . . . . . . 1 . . 1 1 . . . . .
## 9  . . . . . . . . . . . . . 1 1 . . . . .
## 10 . . . . . . . . . . 1 . . 1 1 . . . . .
## 11 . . . . . . . . . . . . . 1 1 . . . . .
## 12 . . . . . . 1 . . . 1 . . . . . . . . .
## 13 . . . . . . . . 1 . 1 . . . . . . . . .
## 14 . . . . . . . . . . 1 . 1 1 1 . . . . .
## 15 . . . . . . 1 . . . 1 . . 1 1 . . . . .
## 16 . . . . . . . . . . 1 . . . . . . . . .
## 17 . . . . . . 1 . . . 1 . . 1 1 . . . . .
## 18 . . . . . . . . . . . . 1 1 1 . . . . .
## 19 . . . . . . . . . . . 1 . 1 1 . . . . .
## 20 . . . . . . . . . . . . . 1 1 . . . . .
  1. Run vsp on each flock(i.e. categorized tweets in each flock into 15 topics, k=15). We can use screeplot function to visualize the singular value.
covid_fa<-lapply(covid_matrix,function(x){
  fa<-vsp(x,k=15)
})
## Warning in n * d: NAs produced by integer overflow

## Warning in n * d: NAs produced by integer overflow

## Warning in n * d: NAs produced by integer overflow

## Warning in n * d: NAs produced by integer overflow

## Warning in n * d: NAs produced by integer overflow
#screeplot for the liberals flock
screeplot(covid_fa[['liberals']])

#use pair plot to plot the factors, serving as a diagnostic measure, here 1:5 means that plot the first 5 factors.
plot_varimax_z_pairs(covid_fa[['liberals']],1:5)

  1. For each flock, according to the vsp result, for each topic(i.e. column in the matrix), find the top 10 rows that have the highest score. (i.e. top 10 tweets that are most correlated with each topic, this will return the row index of the tweets, and we can use these indices to find the correspondent tweets from the tweet-by-flock data, which is the one from Data Cleaning/step1). The matrix Z is the one we will use. It shows how much tweet(row) i belongs to topic(column) j.
#find top 10 tweet ids
topTweets = 10
topid_ls<-lapply(covid_fa, function(x){
  topid<-x$Z %>% apply(2, function(t) which(rank(-t, ties.method = "random") <= topTweets))
})

#get the correspondent tweets
toptweet_ls<-lapply(seq_along(topid_ls), function(i){
  df<-topid_ls[[i]]
  txtdf<-covid_flockls[[i]]
  ls<-list()
  for(j in 1:ncol(df)){
    name<-paste("topic", j)
    ls[[j]]<-tibble(tweet = txtdf$text[df[,j]], status_id=txtdf$status_id[df[,j]],screen_name=txtdf$screen_name[df[,j]])
    names(ls)[j]<-name
  }
  return(ls)
})
#rename elements in toptweet_ls as flock names ("liberals","conservatives",...)
names(toptweet_ls)<-names(topid_ls)
  1. if you want to embed the tweets into websites, you can mutate a column for tweet-post-urls. Or, if you are using short codes, you can skip this step and directly use status id to get the html widgets, just as step 5 shows.
#paste twitter link to each dataframe
#get_embed url of twitter post
get_embed <- function(status_id){
  api_result <- httr::GET(paste0(
    "https://publish.twitter.com/oembed?url=https%3A%2F%2Ftwitter.com%2FInterior%2Fstatus%2F",
    status_id))
  api_content <- httr::content(api_result)
  html_content <- api_content[["html"]]

  return(html_content)
}
toptweet_ls_withlink<-lapply(toptweet_ls,function(ls){
  lapply(ls, function(topic){
    topic%>%
      mutate(status_url=paste0("https://twitter.com/",screen_name,"/status/",status_id))%>%
      mutate(embed_url=map(status_id,get_embed))
  })
})
#example of top tweets for topic 10 in flock liberals
toptweet_ls_withlink[['liberals']][['topic 10']]
## # A tibble: 10 x 5
##    tweet                  status_id   screen_name status_url           embed_url
##    <chr>                  <chr>       <chr>       <chr>                <list>   
##  1 "🚨\U0001f9a0 local us… 1234184131… shawbear76  https://twitter.com… <chr [1]>
##  2 "\U0001f9a0🙏🏾 coronav… 1237756870… veterans_i  https://twitter.com… <chr [1]>
##  3 "@rpdandy @veterans_i… 1238908617… shawbear76  https://twitter.com… <chr [1]>
##  4 "@rpdandy @veterans_i… 1238912196… shawbear76  https://twitter.com… <chr [1]>
##  5 "#lockherup 🤑👎🏽 @sena… 1240962315… veterans_i  https://twitter.com… <chr [1]>
##  6 "😳🤬🤦🏽‍♀️ as the coron…  1243012974… shawbear76  https://twitter.com… <chr [1]>
##  7 "@haldonahue @birdieg… 1243278534… shawbear76  https://twitter.com… <chr [1]>
##  8 "😲\U0001f9a0 infant d… 1244075766… shawbear76  https://twitter.com… <chr [1]>
##  9 "😲\U0001f9a0 infant d… 1244077707… MeAShrew    https://twitter.com… <chr [1]>
## 10 "🤭\U0001f9a0🙏🏾 cnn's … 1245022567… veterans_i  https://twitter.com… <chr [1]>
  1. Now, finally, let’s use short codes to display some of the twitter-post html widgets. Here is the example of tweets under flock liberals, topic 10.
shortcodes("tweet",toptweet_ls_withlink[['liberals']][['topic 10']]$status_id)
Hannah Szu-Han Wang
Hannah Szu-Han Wang
PhD student

My research interests include data science, high-dimensional statistics, sustainable interventions and investments, and human well-being