An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (2024)

Analyze audience sentiment using Twitter developer APIs and R libraries

Published in

Towards Data Science

Data Pulling with Twitter API

First, we need to pull in a few R libraries to help with our text cleaning, visualization, and sentiment analysis.

library(rtweet)
library(stopwords) 
library(dplyr) 
library(tidyr) 
library(tidytext) 
library(wordcloud)
library(devtools)
library(tidyverse) 
library(stringr)
library(textdata)

There are a variety of options available for pulling social media data, such as Netlytic (cloud-based text analyzer and social networks visualization tool including data export) and Twitter developer APIs. For our use case, we will be using the Twitter API to pull tweets containing a variety of keywords. (Note: this API only enables us to pull data from the previous 6–9 days only).

You can read more about creating a developer account, using the Twitter API, and creating access tokens for your use here: https://cran.r-project.org/web/packages/rtweet/vignettes/auth.html. Once this is done, you can access more info from your developer portal.

Once you have your app key, consumer keys, and access tokens, you can use create_token() to generate authorization tokens so that you may pull your tweets following.

create_token(app = app_name, 
 consumer_key = consumer_key, 
 consumer_secret = consumer_secret, 
 access_token = access_token, 
 access_secret = access_secret)

Now, you’re all set to begin pulling your data! Using search_tweets() from the rtweets library, we can use a few keywords. In using boolean operators, we can search for any tweets containing at least 1 of our given search terms. By default, this returns only 100 tweets with a max limit of 18,000. However, you can set a parameter to enable more tweets to be returned with ‘retryonratelimit’ set to true (based on your account access level — there is a limit on the number of total tweets you can pull so I’d recommend only using this with higher access levels). There are other options to grab the most popular tweets or mixed between recent and popular, while the default setting is to return recent tweets. For the purpose of general sentiment analysis and time, I have opted to not include retweets.

More info on query formatting/flexibility/options are available here: https://www.rdocumentation.org/packages/rtweet/versions/0.7.0/topics/search_tweets

tweets_data <- search_tweets('Baker Mayfield OR Mayfield OR Cleveland Browns OR Stefanski', include_rts = FALSE, lang = 'en', n = 18000)

Observe number of tweets returned

nrow(tweets_data)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (3)

The Twitter API returns a lot of detailed information around Tweets, including hashtags length, url modes, etc. For the purpose of our analysis, we are interested in information from the first 16 columns which range from user_id to reply_count.

tweets_data <- tweets_data[,1:16]
summary(tweets_data)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (4)

Now, we can take a look at a few of the Tweets and their content…

head(tweets_data$text)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (5)

We can observe what date had the most tweets on our topics chosen as well.

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (6)

We can see now that the overwhelming majority of tweets occurred on Tuesday, January 4th, which was a day after Browns vs. Steelers game (and Mayfield’s last game of the season). This is important to keep in mind as we continue with our analysis, as this likely influences the sentiments in our data.

Some Initial Exploration and Data Preprocessing

We can use the unnest_tokens() function from the tidytext library to expand our tweets into individual words, format the words to lowercase, and remove any punctuation, then filter out unneeded words (the, to, and, is, etc.) for our analysis using predefined stopwords.

words_data <- tweets_data %>% select(text) %>% 
 unnest_tokens(word, text)words_data %>% count(word, sort = TRUE)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (7)

Prior to removing stop words, using anti_join(stop_words), we can see a few of the most common words will be in the stop_words, such as https and t.co, so we can filter those out, filter out stop words, then examine once more.

words_data <- words_data %>% filter(!word %in% c('https', 't.co', 'he\'s', 'i\'m', 'it\'s'))words_data2 <- words_data %>%
 anti_join(stop_words) %>%
 count(word, sort = TRUE)head(words_data2, n = 10)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (8)

Now, we can see our words are a bit cleaner, and we can examine our cleaned data in a word cloud prior to examining sentiments.

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (9)

Sentiment Analysis at Word Level Using Bing Lexicon

words_data2 %>%
 inner_join(get_sentiments("bing")) %>%
 count(sentiment, sort = TRUE)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (10)

We can see here that the majority of words are considered negative. If we want to gather a sense of what words in our data are being categorized as positive or negative, we can take a peak using a comparison word cloud (and exclude any profanity using the sentimentr library).

profanity_list <- unique(tolower(lexicon::profanity_alvarez))words_data %>% filter(!word %in% c('https', 't.co', 'he\'s', 'i\'m', 'it\'s', profanity_list)) %>%
 inner_join(get_sentiments("bing")) %>%
 count(word, sentiment, sort = TRUE) %>%
 acast(word ~ sentiment, value.var = "n", fill = 0) %>%
 comparison.cloud(colors = c("red", "blue"),
 max.words = 50)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (11)

This now gives us a deeper glimpse into our categories. We can see words like “better”, “fans”, “win”, and “best” are positive, while words like “offensive”, “injury”, “bad”, or “issues” are negative. However, some words may be applicable to both a negative and positive sentiment using bing lexicon depending on the context. The positively-classified “progressive” word is likely referencing the Quarterback’s series of Progressive commercials, while the negatively-classified “loss” may simply be referencing the Browns loss to the Steelers. Without further examination, the classification of these words could be misconstrued as it may depend on the context of the full tweet or sentence.

Sentiment Analysis at Full Tweet Level Using Sentimentr

Using the sentimentr library, we can analyze full tweets and examine a meanSentiment score instead of word-by-word classification.

library(sentimentr)
tweet_sentences_data <- sentiment(get_sentences(tweets_data$text)) %>% 
 group_by(element_id) %>% 
 summarize(meanSentiment = mean(sentiment))head(tweet_sentences_data)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (12)

The meanSentiment tells us how positive or negative the sentiment is. If it is a positive number, it is positive sentiment, and vice versa for negative sentiment. If it is 0, it is simply neural.

We can also observe how positive the most positive tweet is versus how negative the most negative tweet is, and we can get a count within each group. With these counts, we can visualize the balance of sentiment across our data using a visualization! (who doesn’t love a good viz)

print(paste0("Most negative tweets sentiment: ", min(tweet_sentences_data$meanSentiment)))
print(paste0("Most positive tweets sentiment: ", max(tweet_sentences_data$meanSentiment)))print(paste0("# of Negative Tweets: ", sum(tweet_sentences_data$meanSentiment < 0)))
print(paste0("# of Neutral Tweets: ", sum(tweet_sentences_data$meanSentiment == 0)))
print(paste0("# of Positive Tweets: ", sum(tweet_sentences_data$meanSentiment > 0)))

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (13)

We can see that our most negative tweet is in fact very negative at near -1 and vice versa for positive tweets. For a more powerful display of our findings, we can create a visualization showing the balance of each sentiment using our sentiment counts.

slices <- c(sum(tweet_sentences_data$meanSentiment < 0), sum(tweet_sentences_data$meanSentiment == 0),
 sum(tweet_sentences_data$meanSentiment > 0))
labels <- c("Negative Tweets: ", "Neutral Tweets: ", "Positive Tweets: ")pct <- round(slices/sum(slices)*100)
labels <- paste(labels, pct, "%", sep = "") #customize labeling#add in appropriate colors for positive, neutral, negative
pie(slices, labels = labels, col=c('red', 'yellow', 'green'), 
 main="Tweet Sentiment Percentages")

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (14)

At the tweet level, we can see the sentiments across our Tweets pulled are much more balanced here than at the word level.

Sentiment Analysis at User Level

Another interesting expansion of this analysis is to show the sentiment per each user, as some users may have multiple tweets that differ in sentiment. However, we have over 4000 users in our dataset. For a cleaner visual and easier initial exploration, we will limit our data to the top 50 favorited tweets and their respective users.

n_distinct(tweets_data$user_id)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (15)

#selecting top 50 tweets by favorites
user_sentiment <- tweets_data %>% select(user_id, text, favorite_count) %>% arrange(desc(favorite_count)) %>% slice(1:50)
head(user_sentiment)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (16)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (17)

Now we have our data ordered by descending favorite counts and limited to the top 50 tweets, and we can easily group sentiment per each user and gather a better understanding of these users’ sentiments using sentiment_by() from the sentimentr library once more.

out <- sentiment_by(get_sentences(user_sentiment$text), 
 list(user_sentiment$user_id))plot(out)

An Intro to Sentiment Analysis in R — How Does Twitter Feel about Baker Mayfield? (18)

This enables us to gather a better understanding of the sentiment per each user, as some users have a broad range of sentiment scoring across all of their tweets and others are completely neutral (perhaps with just 1 tweet or multiple completely neutral ones).

Potential Expansions on Sports Sentiment Analysis Using Social Media Data

There are many additional applications for sports data sentiment analysis in this context, where you can expand your analysis even further. It may be interesting to gather tweets directly from top/popular sports analysts and perform sentiment analysis based on their social media input for each team and analyze further for any favoritism/bias. It may be interesting to perform sentiment analysis across a given season and gauge how sentiment changes with each game or organizational changes as well. The possibilities are endless, and take time to explore if interested!