VSP on Tweet

A hands-on example of the vsp library using a sampled Twitter data-set of tweets under #coronavirus.

Published

June 16, 2020

Introduction

This is a hands-on example of the vsp library. A sampled Twitter data-set of tweets under #coronavirus was used for demonstration.

Methods

Libraries

Required libraries are dplyr, vsp, tidytext, purrr, and blogdown. The latter can be installed through devtools.

#load libraries
library(dplyr)
#devtools::install_github("RoheLab/vsp")
library(vsp)
library(tidytext)
library(purrr) #for map function
library(blogdown) #for shortcodes in rmarkdown

Data Source & Data Description

Tweets were collected from Twitter API, and the twitter accounts were further divided into flocks, which is the same as the murmuration website. The data-set can be downloaded from here. Here we name the data-set “covid”.

Data description:

Date: YYYY/MM/DD. Collected from Feb 1st to March 31st.
user_id: the id of the user that tweeted
status_id: the id of the tweet post
screen_name: the @name shown to the public (e.g. @realdonaldtrump’s screen name is realdonaldtrump)
flock_category: indicates the category of account clusters. There are 6 categories: liberals, conservatives, media, issue-centric, pop culture, academia

#top several lines of the data-set
head(covid)

Data Cleaning

1. For convenience, split the data by flock, and store it as a list:

covid_flockls <- split(covid, f = covid$flock_category)[1:6]

2. Extract the tweet text for each flock from the list generated by step 1:

covid_textls <- lapply(covid_flockls, function(x) {
    tibble(tweet = 1:nrow(x), text = x$text)
})

3. Further, for each flock, unnest the tweet text into tokens:

covid_ttls <- lapply(covid_textls, function(x) {
    x %>% unnest_tokens(word, text)
})

#first several lines of the liberal flock
head(covid_ttls[['liberals']])

Run VSP

1. Make sparse matrix for each flock:

covid_matrix <- lapply(covid_ttls, function(x) {
    dt <- cast_sparse(x, tweet, word)
})

#first 20 rows and columns of the liberal sparse matrix
covid_matrix[['liberals']][1:20, 1:20]

2. Run VSP on each flock (categorize tweets into 15 topics, k=15). We can use screeplot to visualize the singular values:

covid_fa <- lapply(covid_matrix, function(x) {
    fa <- vsp(x, k = 15)
})

#screeplot for the liberals flock
screeplot(covid_fa[['liberals']])

#use pair plot to plot the factors (diagnostic measure)
plot_varimax_z_pairs(covid_fa[['liberals']], 1:5)

3. For each flock, according to the VSP result, for each topic (i.e. column in the matrix), find the top 10 rows that have the highest score. The matrix Z shows how much tweet i belongs to topic j:

topTweets <- 10
topid_ls <- lapply(covid_fa, function(x) {
    topid <- x$Z %>% apply(2, function(t)
        which(rank(-t, ties.method = "random") <= topTweets))
})

#get the correspondent tweets
toptweet_ls <- lapply(seq_along(topid_ls), function(i) {
    df <- topid_ls[[i]]
    txtdf <- covid_flockls[[i]]
    ls <- list()
    for (j in 1:ncol(df)) {
        name <- paste("topic", j)
        ls[[j]] <- tibble(
            tweet = txtdf$text[df[, j]],
            status_id = txtdf$status_id[df[, j]],
            screen_name = txtdf$screen_name[df[, j]]
        )
        names(ls)[j] <- name
    }
    return(ls)
})
names(toptweet_ls) <- names(topid_ls)

4. If you want to embed the tweets into websites, you can mutate a column for tweet-post-urls:

get_embed <- function(status_id) {
    api_result <- httr::GET(paste0(
        "https://publish.twitter.com/oembed?url=https%3A%2F%2Ftwitter.com%2FInterior%2Fstatus%2F",
        status_id
    ))
    api_content <- httr::content(api_result)
    html_content <- api_content[["html"]]
    return(html_content)
}

toptweet_ls_withlink <- lapply(toptweet_ls, function(ls) {
    lapply(ls, function(topic) {
        topic %>%
            mutate(status_url = paste0(
                "https://twitter.com/", screen_name, "/status/", status_id
            )) %>%
            mutate(embed_url = map(status_id, get_embed))
    })
})

#example of top tweets for topic 10 in flock liberals
toptweet_ls_withlink[['liberals']][['topic 10']]

5. Finally, use short codes to display the twitter-post HTML widgets:

shortcodes("tweet",
    toptweet_ls_withlink[['liberals']][['topic 10']]$status_id)