Sentiment Analysis of Twitter Data (2024)

Twitter Introduction

Recent years have witnessed the rapid growth of social mediaplatforms in which users can publish their individual thoughts andopinions (e.g., Facebook, Twitter, Google+ and several blogs). The risein popularity of social media has changed the world wide web from astatic repository to a dynamic forum for anyone to voice their opinionacross the globe. This new dimension of User Generated Contentopens up a new and dynamic source of insight to individuals,organizations and governments.

Social network sites or platforms, are defined as web-based servicesthat allow individuals to:

Construct a public or semi-public profile within a boundedsystem.
Articulate a list of other users with whom they share aconnection.
View and traverse their list of connections and those made by otherswithin the system.

The nature and nomenclature of these connections may vary from siteto site.

This package, saotd is focused on utilizing Twitter datadue to its widespread global acceptance. Harvested data, analyzed forsentiment can provide powerful insight into a population. This insightcan assist organizations, by letting them better understand their targetpopulation. This package will allow a user to acquire data using thePublic Twitter Application Programming Interface (API), to obtaintweets.

The saotd package is broken down into five differentphases:

Acquire
Explore
Topic Analysis
Sentiment Calculation
Visualization

The saotd package workflow can be observed referencedvia the below image that will take and analysis from the Twitter API tothrough a complete analysis.

Packages

library(saotd)library(dplyr)library(stringr)library(knitr)

Acquire

To explore the data manipulation functions of saotd wewill use the built in dataset saotd::raw_tweets.

However is you want to acquire your own tweets, you will first haveto:

Create a twitter account orsign into existing account.
Use your twitter login, to sign into TwitterDevelopers
Navigate to My Applications.
Fill out the new application form.
- You will be asked to provide a website.
- You can input your twitter account website.
- For example: https://twitter.com/yourusername
Create access token.
- Record twitter access keys and tokens

With these steps complete you now have access to the twitter API.

To acquire your own dataset of tweets you can use thesaotd::tweet_acquire function and insert your consumer key,consumer secret key, access token and access secret key gained from theTwitter Developerspage. You will also need to select the #hashtags you are interested inand the number of tweets requested per #hashtag.

consumer_api_key <- "XXXXXXXXXXXXXXXXXXXXXXXXX"consumer_api_secret_key <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"access_token <- "XXXXXXXXXXXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"access_token_secret <- "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"hashtags <- c("#job", "#Friday", "#fail", "#icecream", "#random", "#kitten", "#airline")tweets <- tweet_acquire( twitter_app = "twitter_app", consumer_api_key = Sys.getenv('consumer_api_key'), consumer_api_secret_key = Sys.getenv('consumer_api_secret_key'), access_token = Sys.getenv('access_token'), access_token_secret = Sys.getenv('access_token_secret'), query = "#icecream", num_tweets = 100, distinct = TRUE)

Explore

You can acquire your own data or use the dataset included with thepackage. We will be using the included data raw_tweets.This dataset was acquired from a TwitterUS Airline Sentiment Kaggle competition, from December 2017. Thedataset contains 14,487 tweets from 6 different hashtags (2,604 x#American, 2,220 x #Delta, 2,420 x #Southwest, 3,822 x #United, 2,913 x#US Airways, 504 x #Virgin America).

set.seed(4321)data("raw_tweets")TD <- raw_tweets %>%  dplyr::sample_n(size = 5000,  replace = TRUE)

The first tweet of the dataset is: “@SouthwestAir I filled in the form on thewebsite too. Darn it all. I guess I’ll just have to cross my fingers.”,and when it is cleaned and tidy’d it becomes:

TD_Tidy <-  saotd::tweet_tidy( DataFrame = TD)TD_Tidy$Token[1:9] %>%  knitr::kable("html")

x
southwestair
filled
form
website
darn
guess
ill
cross
fingers

The cleaning process removes: “@”, “#” and “RT” symbols, Weblinks,Punctuation, Emojis, and Stop Words like (“the”, “of”, etc.).

We will now investigate Uni-Grams, Bi-Grams and Tri-Grams.

Twitter data Uni-Grams
word	n
united	1454
flight	1314
usairways	1073
americanair	930
southwestair	860
jetblue	813
cancelled	380
service	319
time	288
im	270

Twitter data Bi-Grams
word1	word2	n
customer	service	198
cancelled	flightled	178
late	flight	85
cancelled	flighted	80
late	flightr	52
cancelled	flight	49
2	hours	40
usairways	americanair	38
3	hours	34
flight	booking	31

Twitter data Tri-Grams
word1	word2	word3	n
NA	NA	NA	54
cancelled	flightled	flight	20
flight	cancelled	flightled	17
worst	customer	service	16
poor	customer	service	10
customer	service	rep	8
hours	late	flightr	8
southwestair	flight	cancelled	8
cancelled	flighted	flight	7
cancelled	flightled	flights	7
flight	cancelled	flighted	7
hours	late	flight	7

Twitter data Uni-Grams
word	n
united	1454
flight	1265
usairways	1073
americanair	930
southwestair	860
jetblue	813
service	319
time	288
im	270
customer	263

Twitter data Bi-Grams
word1	word2	n
customer	service	198
late	flight	85
late	flightr	52
2	hours	40
usairways	americanair	38
3	hours	34
flight	booking	31
gate	agent	29
united	im	26
usairways	flight	23

Twitter data Tri-Grams
word1	word2	word3	n
NA	NA	NA	54
worst	customer	service	16
poor	customer	service	10
customer	service	rep	8
hours	late	flightr	8
hours	late	flight	7
30	min	late	6
cent	latinasciilatinasciilatinascii	cent	6
customer	service	desk	6
jetblue	flight	delayed	6
min	late	flight	6
southwestair	flight	cancelledflightled	6

Sentiment Calculation

Now that the data has been explored we will need to compute theSentiment scores for the hashtags.

TD_Scores <-  saotd::tweet_scores( DataFrameTidy = TD_Tidy, HT_Topic = "hashtag")

With the scores computed we can then observe the positive andnegative words within the dataset.

saotd::posneg_words( DataFrameTidy = TD_Tidy,  num_words = 10)

## Selecting by n

As an example we can see that the negative term “fail” is dwarfingall other responses. If we would like to remove “fail” we can easily doit.

saotd::posneg_words( DataFrameTidy = TD_Tidy,  num_words = 10,  filterword = "fail")

## Selecting by n

We can see the most positive tweets hashtags within the the dataset.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

## # A tibble: 6 × 10## text method hashtags created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @Ameri… Bing American 2015-02-2… polp… 0 12 12## 2 @South… Bing Southwe… 2015-02-1… waln… 0 10 10## 3 @South… Bing Southwe… 2015-02-2… Nico… 0 9 9## 4 @South… Bing Southwe… 2015-02-2… Walt… 0 9 9## 5 @unite… Bing United 2015-02-2… Core… 0 9 9## 6 @JetBl… Bing Delta 2015-02-2… Dres… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative hashtag tweets within the dataset.

saotd::tweet_min_scores( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

Topic Analysis

If we were interested in conducting a topic analysis on the tweets wewould then determine the number of latent topics within the tweetdata.

saotd::number_topics( DataFrame = TD,  num_cores = 4L,  min_clusters = 2,  max_clusters = 12,  skip = 1,  set_seed = 1234)

The number of topics plot shows that between 5 and 7 latent topicsreside within the dataset. For this example we could select between 5and 7 topics to categorize this data. In this case 5 topics will beselected to continue the analysis.

TD_Topics <-  saotd::tweet_topics( DataFrame = TD,  clusters = 5,  method = "Gibbs",  set_seed = 1234,  num_terms = 10)

In a markdown product the topics table does not print clearly, unlikewhen it is printed in the console. However the words associated witheach topic can be observed in the below table.

Number	Topic 1	Topic 2	Topic 3	Topic 4	Topic 5
1	united	usairways	americanair	southwestair	flight
2	service	time	usairways	jetblue	cancelled
3	customer	plane	amp	im	hours
4	dont	gate	hold	virginamerica	flights
5	bag	jetblue	call	guys	2
6	check	hour	phone	fly	delayed
7	luggage	waiting	wait	airline	flightled
8	dm	delay	ive	flying	late
9	lost	people	cange	seat	3
10	worst	minutes	day	love	weather

One of the challenges of using a topic model is selecting the correctnumber of topics. As we can see in the above chart. We went from 6hashtags to 5 different topics.

While this may not be the best example to use, we will continue thetopic modeling example. We would first want to rename the topics intosomething that would make sense. In this case Topic 1 could be luggage,Topic 2 could be delay, Topic 3 could be customer_service, Topic 4 couldbe enjoy, and Topic 5 could be delay These topics were chosen byobserving the words associated with each topic. This selection could bedifferent depending on experience and a deeper understanding of thetopics.

We would then want to rename the topics in the dataframe

TD_Topics <- TD_Topics %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^1$", "luggage")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^2$", "gate_delay")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^3$", "customer_service")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^4$", "enjoy")) %>%  dplyr::mutate(Topic = stringr::str_replace_all(Topic, "^5$", "other_delay"))

Next we would want to tidy and then score the new topic dataset.

TD_Topics_Tidy <-  saotd::tweet_tidy( DataFrame = TD_Topics)TD_Topics_Scores <-  saotd::tweet_scores( DataFrameTidy = TD_Topics_Tidy, HT_Topic = "topic")

We can see the most positive topic tweets within the data set.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @American… Bing lugg… 2015-02-2… polp… 0 12 12## 2 @Southwes… Bing lugg… 2015-02-1… waln… 0 10 10## 3 @Southwes… Bing lugg… 2015-02-2… Nico… 0 9 9## 4 @Southwes… Bing lugg… 2015-02-2… Walt… 0 9 9## 5 @united W… Bing lugg… 2015-02-2… Core… 0 9 9## 6 @JetBlue … Bing enjoy 2015-02-2… Dres… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

We can also see the most negative topics tweets within the dataset.

saotd::tweet_min_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @JetBlue … Bing enjoy 2015-02-1… Grac… 10 0 -10## 2 @USAirway… Bing gate… 2015-02-1… thec… 9 0 -9## 3 @USAirway… Bing cust… 2015-02-2… lj_v… 9 0 -9## 4 @JetBlue … Bing enjoy 2015-02-1… Cure… 8 0 -8## 5 @Southwes… Bing cust… 2015-02-2… Dead… 8 0 -8## 6 @united y… Bing cust… 2015-02-2… mace… 8 0 -8## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Furthermore if we wanted to observe the most positive or negativescores associated with a specific topic we could also do that.

saotd::tweet_max_scores( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic", HT_Topic_Selection = "luggage")

## # A tibble: 6 × 10## text method Topic created_at key negative positive TweetSentimentScore## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>## 1 @American… Bing lugg… 2015-02-2… polp… 0 12 12## 2 @Southwes… Bing lugg… 2015-02-1… waln… 0 10 10## 3 @Southwes… Bing lugg… 2015-02-2… Nico… 0 9 9## 4 @Southwes… Bing lugg… 2015-02-2… Walt… 0 9 9## 5 @united W… Bing lugg… 2015-02-2… Core… 0 9 9## 6 @Southwes… Bing lugg… 2015-02-2… woaw… 0 6 6## # ℹ 2 more variables: TweetSentiment <chr>, date <date>

Visualizations

Hashtags

Now we will begin visualizing the hashtag data. The distribution ofthe sentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution( DataFrameTidyScores = TD_Scores,  color = "black",  fill = "white")

Additionally if we wanted to see the score distributions per eachhashtag, we can find it below.

saotd::tweet_distribution( DataFrameTidyScores = TD_Scores,  HT_Topic = "hashtag",  bin_width = 1,  color = "black",  fill = "white")

We can also observe the hashtag distributions as a Box plot.

saotd::tweet_box( DataFrameTidyScores = TD_Scores,  HT_Topic = "hashtag")

Also as a Violin plot. The chevrons in each violin plot denote themedian of the data and provide a quick reference point to see if ahashtag is generally positive or negative. For example the “random”hashtag has a generally negative sentiment, where as the “kitten”hashtags has a generally positive sentiment.

saotd::tweet_violin( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

One of the more interesting ways to visualize the Twitter data is toobserve the change in sentiment over time. This dataset was acquired ona single day and therefore some of the hashtags did not overlap days.However some did and we can see the change in sentiment scores throughtime.

saotd::tweet_time( DataFrameTidyScores = TD_Scores, HT_Topic = "hashtag")

Finally if a Twitter user has not disabled georeferencing data thelocation of the tweet can be observed. However in many cases this maynot be very insightful because of the lack of data.

Topics

Now we will begin visualizing the topic data. The distribution of thesentiment scores can be found in the below plot.

saotd::tweet_corpus_distribution( DataFrameTidyScores = TD_Topics_Scores,  color = "black",  fill = "white")

Additionally if we wanted to see the score distributions per eachtopic, we can find it below.

saotd::tweet_distribution( DataFrameTidyScores = TD_Topics_Scores,  HT_Topic = "topic",  bin_width = 1,  color = "black",  fill = "white")

We can also observe the topic distributions as a Box plot.

saotd::tweet_box( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

saotd::tweet_violin( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")

saotd::tweet_time( DataFrameTidyScores = TD_Topics_Scores, HT_Topic = "topic")