Twitter Sentiment and Engagement: The Case of the Biden Campaign

Sentiment Analysis
Using several different predictive modeling strategies, I investigated how sentiment influences political reach on Twitteras well as created prediction models to predict the reach of a vet-to-be-tweeted tweet. This project found negative political tweets had higher engagement, both in terms of favorites and retweets, than non-negative tweets.
Author

Neil Fasching

Published

October 1, 2021

Twitter Sentiment and Engagement: The Case of the Biden Campaign

  • Introduction: 310
  • Hypotheses & Sub-RQs: 523
  • Gathering data: 524
  • Data Exploration & Evaluation: 562
  • Evaluation: 548
  • Limitations and Next Steps: 544
  • Ethical and Normative Considerations: 559

Introduction

As the Presidential Election for the United States draws nearer, the Joe Biden campaign has run into a problem with its Twitter campaign. With just a few weeks left before the election, the Biden communication department is sharply divided in how to use his Twitter account in the final stretch of the campaign. Several members of the communications staff believe that in order to drum up enthusiasm among Biden supporters, his Twitter account should be used for negative campaigning. This would include character attacks and policy attacks against Donald Trump. Other members of Biden’s communication team believe just the opposite: negative campaigning will backfire for Biden. While negativity might help Biden rally some of his supporters, they argue, this will lead Trump to also go negative, which will benefit him more than Biden. Further, they argue that Democrats are different than Republicans and won’t react as favorably to the negativity as Trump supporters. As such the campaign has this research question:

RQ: Do tweets from political candidates that contain negative sentiment receive more engagement than tweets from political candidates that are not negative?

Many communication challenges cannot be solved by the use of digital data. However, as the present RQ question boils down to how different types of social media posts lead to different levels of online engagement with that post, this problem is one that should be looked at through the lens of digital data. Further, this case is relevant both from a theoretical perspective as well as a societal perspective. There has been ample research both into the negativity bias (Soroka & McAdams, 2015) as well as negative campaigning (Carraro & Castelli, 2010). This case will add to the research into whether a negativity bias also exists for political tweets as well as flush out the efficacy of negative campaigning on Twitter. For society, this research could also affect the campaign style of the Presidential race.

Hypotheses

People have a “negativity bias” when it comes to consuming news content, with individuals putting more weight and attention on negative information (Trussler & Soroka, 2014). Negative news, also known as “adverse media,” is news that focuses on unfavorable information and is often defined by its negative tone (Soroka, Fournier, & Nir, 2019). Studies have shown that people pay more attention to negative information than to positive information and are more likely to engage with it (Soroka & McAdams, 2015). As such, it is logical to think that negative tweets, or tweets with a negative sentiment, are more likely to attract the attention of Twitter users and lead to more engagement. Past research lends this support. Oz, Zheng, and Chen (2017) found that negative tweets had higher engagement than non-negative tweets when it comes to responses to White House’s Facebook and Twitter pages. Therefore, based on this argument by the members of Biden’s communication staff that argue in favor of negativity, the first two hypotheses are:

H1a: Negative sentiment in a tweet will be positively associated with number of retweets with the tweet.

H1b: Negative sentiment in a tweet will be positively associated with number of favorites with the tweet.

The opponents of the negative campaign strategy, however, have a valid point. Trump is a special case, who, as an avid twitter user, often resorts to coarse language, personal attacks, and straight incivility (Ott, 2017). Trump’s followers are not only more accustomed to the use of negative sentiment, they have actually shown a strong preference for tweets that include personal attacks (Lee & Xu, 2018). Therefore, while negativity might help Biden, it would help Trump even more. If the campaign becomes more negative on Twitter, that could backfire, leading Trump to be more negative and increasing his Twitter engagement. As such, the second set of hypotheses are:

H2a: The positive effect of negative sentiment on number of retweets will be greater for Trump tweets than for Biden tweets.

H2b: The positive effect of negative sentiment on number of favorites will be greater for Trump tweets than for Biden tweets.

Finally, the opponents of the negative campaign strategy also contend that Republicans are different than Democrats. The extensive work into Ideological Asymmetries by Jost (2017) backs this up. As people choose an ideology that aligns with their own psychological motivations, people of different ideologies are likely to have psychological differences. For example, research shows that Republicans a greater need to manage uncertainty and fear, while Democrats are more willing to accept some level of uncertainty in the hopes of social progress (Jost et al., 2003). It is possible that Democrats and Republicans also respond differently to the negativity. While personal attacks may work well with Republicans, that might not be the case for Democrats. Therefore, the final set of hypotheses are:

H3a: The positive effect of negative sentiment on number of retweets will be greater for Trump and Pence tweets than for Biden and Harris tweets.

H3b: The positive effect of negative sentiment on number of favorites will be greater for Trump and Pence tweets than for Biden and Harris tweets.

Data Collection

As the business challenge involves comparing the tweets, the first step in gathering the data was the obtain the last the recent tweets of Donald Trump, Joe Biden, Kamala Harris, and Mike Pence. To do this, the last 3,200 tweets from each twitter were gathered using Twitter’s API on 10 October, 2020. This method was chosen for two reasons: first, as opposed to scrapping the tweets that can often miss collecting relevant data, by using Twitter’s API, we can be reasonably confident that all of the planned tweets were gathered. Second, from a logistical standpoint, the present study is only concerned with recent Twitters that were posted during the election cycle. As Twitter’s API only allows the latest 3,200 tweets from a single user to be downloaded. This could be a problem if all user tweets were required, but since the focus is on the election, the last 3,200 tweets is sufficient.

In additional to obtaining the text of each tweet, the API downloaded some accompanying data, such as time of post, language of post, and whether media was included with the post. Also, relevant to this project, the API includes data of the overall engagement with each tweet, namely number of retweets and number of favorites.

In order to obtain the sentiment of the tweets, VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis will be run on each tweet, and the negative, positive, neutral, and compound polarity scores will be added to the dataset. VADER was chosen as it is quite good at analyzing social media posts (Hutto & Gilbert, 2014).

As for privacy, the tweets will be linked to the individual users, which does pose a problem for the privacy of the twitter user. For example, they may not wish for their tweets to be included in a sentiment analysis. However, as the accounts are used in public campaigns for political office, it would seem likely that the other campaigns are also investigating their twitter data, which mitigates potential privacy concerns. Further, the privacy of the users engaging with the tweets, whether by retweeting or favoriting a post, is protected as no data is collected on those users.

While the reasoning behind the use of Twitter’s API is sound, this does not mean the data is without potential biases. The first bias could be related to the timing of the tweets. Twitter uses tweet at different rates, so the last 3,200 tweets from Trump could represent a much shorter timespan than the last 3,200 tweets from Biden, and therefore could bias the data based on different temporal factors between users. Secondly, there is a clear bias against women and people of color in the dataset. As the dataset contains tweets of three white men and only one woman, the data is skewed towards representing white men. And finally, as only one election at one time is being investigate, the generalizability of the data to other elections could be questioned. That said, as the outcome variable is tweet engagement and not something like loan approval, there are no known unwarranted associations between the outcome and protected features such as race and gender.

Above are the needed packages for the project.

Get Tweets

Above is the code to retrieve the last 3,200 tweets by a user. This code was retrieved from the GetLatest3200TweetsFromUser file.

Code to indicate of which users to collect the tweets.

collecting tweets from user:  realDonaldTrump (maximum rounds = 16)
collected 200 tweets from realDonaldTrump in round 1  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 2  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 3  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 4  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 5  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 6  || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 7  || waiting for 15 seconds
collected 190 tweets from realDonaldTrump in round 8  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 9  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 10  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 11  || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 12  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 13  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 14  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 15  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 16  || waiting for 15 seconds
realDonaldTrump completed
collecting tweets from user:  JoeBiden (maximum rounds = 16)
collected 200 tweets from JoeBiden in round 1  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 2  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 3  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 4  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 5  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 6  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 7  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 8  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 9  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 10  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 11  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 12  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 13  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 14  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 15  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 16  || waiting for 15 seconds
JoeBiden completed
collecting tweets from user:  KamalaHarris (maximum rounds = 16)
collected 200 tweets from KamalaHarris in round 1  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 2  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 3  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 4  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 5  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 6  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 7  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 8  || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 9  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 10  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 11  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 12  || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 13  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 14  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 15  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 16  || waiting for 15 seconds
KamalaHarris completed
collecting tweets from user:  Mike_Pence (maximum rounds = 16)
collected 200 tweets from Mike_Pence in round 1  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 2  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 3  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 4  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 5  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 6  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 7  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 8  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 9  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 10  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 11  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 12  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 13  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 14  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 15  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 16  || waiting for 15 seconds
Mike_Pence completed

Above loop retrieves all tweets. The code has been made into a comment so the data remains the same if all the code is run again.

Trump data

created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorite_count favorited retweeted possibly_sensitive lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status
0 Sat Oct 10 03:09:32 +0000 2020 1314764977597755392 1314764977597755392 I was honored to receive the first ever Presid... False [0, 191] {'hashtags': [{'text': 'LESM', 'indices': [162... {'media': [{'id': 1314700859079524352, 'id_str... <a href="http://twitter.com/download/iphone" r... nan ... 85771 False False False en NaN nan NaN NaN NaN
1 Sat Oct 10 02:36:30 +0000 2020 1314756664143347712 1314756664143347712 RT @marklevinshow: My interview with the presi... False [0, 129] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 0 False False False en {'created_at': 'Fri Oct 09 23:35:36 +0000 2020... nan NaN NaN NaN
2 Fri Oct 09 23:55:24 +0000 2020 1314716123250778114 1314716123250778114 RT @realDonaldTrump: Will be in Sanford, Flori... False [0, 104] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 0 False False False en {'created_at': 'Fri Oct 09 21:04:39 +0000 2020... nan NaN NaN NaN
3 Fri Oct 09 23:35:09 +0000 2020 1314711027326562306 1314711027326562306 Documents reveal that General Flynn was entrap... False [0, 72] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 140093 False False NaN en NaN nan NaN NaN NaN
4 Fri Oct 09 23:31:20 +0000 2020 1314710067699159041 1314710067699159041 .@SteveScully, the Never Trumper next debate m... False [0, 196] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 121620 False False NaN en NaN nan NaN NaN NaN

5 rows × 31 columns

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            2514
source                          0
in_reply_to_status_id        3074
in_reply_to_status_id_str    3074
in_reply_to_user_id          3071
in_reply_to_user_id_str      3071
in_reply_to_screen_name      3071
user                            0
geo                          3165
coordinates                  3165
place                        3165
contributors                 3165
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive           1837
lang                            0
retweeted_status             1598
quoted_status_id             2603
quoted_status_id_str         2603
quoted_status_permalink      2603
quoted_status                2830
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3165

The Trump dataset is imported. Variables are added for indicate the tweets are from Trump, who is a Republican. Then missing values are checked for the text of the tweet as well as the newly created variables. Finally, I check the lengths of the datasets. The same is then done for Joe Biden, Kamala Harris, and Mike Pence.

Biden Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            2113
source                          0
in_reply_to_status_id        3085
in_reply_to_status_id_str    3085
in_reply_to_user_id          3085
in_reply_to_user_id_str      3085
in_reply_to_screen_name      3085
user                            0
geo                          3185
coordinates                  3185
place                        3185
contributors                 3185
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive           1007
lang                            0
quoted_status_id             2733
quoted_status_id_str         2733
quoted_status_permalink      2733
quoted_status                2745
retweeted_status             3031
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3185

Harris Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
source                          0
in_reply_to_status_id        3129
in_reply_to_status_id_str    3129
in_reply_to_user_id          3129
in_reply_to_user_id_str      3129
in_reply_to_screen_name      3129
user                            0
geo                          3183
coordinates                  3183
place                        3182
contributors                 3183
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
lang                            0
possibly_sensitive           1272
retweeted_status             2760
extended_entities            2370
quoted_status_id             2706
quoted_status_id_str         2706
quoted_status_permalink      2706
quoted_status                2729
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3183

Pence Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
source                          0
in_reply_to_status_id        3124
in_reply_to_status_id_str    3124
in_reply_to_user_id          3124
in_reply_to_user_id_str      3124
in_reply_to_screen_name      3124
user                            0
geo                          3185
coordinates                  3185
place                        3185
contributors                 3185
retweeted_status              928
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
lang                            0
possibly_sensitive           1814
extended_entities            2181
quoted_status_id             3066
quoted_status_id_str         3066
quoted_status_permalink      3066
quoted_status                3167
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64

Merge

1.0

Finally, the four datasets are merged. Then a quick check is run to make sure the length of the new dataset is correct.

Data Cleaning

First, a simple inspection of the data is performed.

full_text retweet_count favorite_count
0 I was honored to receive the first ever Presid... 20884 85771
1 RT @marklevinshow: My interview with the presi... 17307 0
2 RT @realDonaldTrump: Will be in Sanford, Flori... 25471 0
3 Documents reveal that General Flynn was entrap... 41969 140093
4 .@SteveScully, the Never Trumper next debate m... 33220 121620
12718
created_at                       0
id                               0
id_str                           0
full_text                        0
truncated                        0
display_text_range               0
entities                         0
extended_entities             9178
source                           0
in_reply_to_status_id        12412
in_reply_to_status_id_str    12412
in_reply_to_user_id          12409
in_reply_to_user_id_str      12409
in_reply_to_screen_name      12409
user                             0
geo                          12718
coordinates                  12718
place                        12717
contributors                 12718
is_quote_status                  0
retweet_count                    0
favorite_count                   0
favorited                        0
retweeted                        0
possibly_sensitive            5930
lang                             0
retweeted_status              8317
quoted_status_id             11108
quoted_status_id_str         11108
quoted_status_permalink      11108
quoted_status                11471
Trump                            0
Biden                            0
Harris                           0
Pence                            0
Republican                       0
dtype: int64

Drop Retweets

The first task was to drop unwanted observations. For this project, tweets that are retweets are not of interest. This was decided for two reasons. First, the research question and hypotheses were about the negativity of Biden’s tweets. This is about the tweets he writes, not the tweets written by other people. It therefore makes sense to exclude retweets. Second, from a more practical standpoint, retweets are not favorited, only the original tweet can be favorited. Therefore, all retweets have a favorite count of zero, which is not an accurate representation of how much people liked or engaged with the retweet. Therefore, it was decided to drop all retweets from the dataset. To do so, a new variable was created to determine if the tweet was a retweet, and if it was, it was dropped.

full_text is_retweet
0 I was honored to receive the first ever Presid... 0
1 RT @marklevinshow: My interview with the presi... 1
2 RT @realDonaldTrump: Will be in Sanford, Flori... 1
3 Documents reveal that General Flynn was entrap... 0
4 .@SteveScully, the Never Trumper next debate m... 0
4401

4,401 of the tweets were retweets.

8317

The new dataset has 8,317 tweets, none of which are retweets.

full_text retweet_count favorite_count is_retweet
0 I was honored to receive the first ever Presid... 20884 85771 0
3 Documents reveal that General Flynn was entrap... 41969 140093 0
4 .@SteveScully, the Never Trumper next debate m... 33220 121620 0
5 Thank you @SenatorDole. So true! https://t.co/... 15147 58881 0
6 https://t.co/UGIAvC7VA3 19078 54239 0

The index of the dataset was then reset.

Check date of Tweets

Next, it was important to ensure that none of the tweets were from before the election cycle, so the date created variable was changed into a datetime variable.

0    Sat Oct 10 03:09:32 +0000 2020
1    Fri Oct 09 23:35:09 +0000 2020
2    Fri Oct 09 23:31:20 +0000 2020
3    Fri Oct 09 23:01:54 +0000 2020
4    Fri Oct 09 22:30:20 +0000 2020
Name: created_at, dtype: object
0   2020-10-10 03:09:32+00:00
1   2020-10-09 23:35:09+00:00
2   2020-10-09 23:31:20+00:00
3   2020-10-09 23:01:54+00:00
4   2020-10-09 22:30:20+00:00
Name: created_at, dtype: datetime64[ns, UTC]
count                          8317
unique                         8191
top       2020-05-19 22:23:51+00:00
freq                              4
first     2019-08-05 17:58:00+00:00
last      2020-10-10 03:09:32+00:00
Name: created_at, dtype: object

The oldest tweet is from August 5th, 2019. This is after all four had begun campaigning so no tweets need to be dropped.

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
7388 199 2019-08-05 17:58:00+00:00 1158437011692429314 1158437011692429314 Gun violence is an epidemic. It impacts our co... False [0, 179] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="https://sproutsocial.com" rel="nofoll... ... 1158211041999970304.000 1158211041999970317 {'url': 'https://t.co/GqZAZurc8D', 'expanded':... {'created_at': 'Mon Aug 05 03:00:05 +0000 2019... 0 0 1 0 0 0

1 rows × 38 columns

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
4628 199 2019-10-26 21:03:00+00:00 1188199370463821824 1188199370463821824 If you work hard, you should be able to share ... False [0, 276] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="https://about.twitter.com/products/tw... ... nan NaN NaN NaN 0 1 0 0 0 0

1 rows × 38 columns

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
1597 185 2020-07-17 16:25:03+00:00 1284162207232733185 1284162207232733185 THANK YOU to the 5 million members of the @NRA... False [0, 284] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... ... 1283748224243728384.000 1283748224243728384 {'url': 'https://t.co/8ZhChqxgBI', 'expanded':... {'created_at': 'Thu Jul 16 13:00:02 +0000 2020... 1 0 0 0 1 0

1 rows × 38 columns

Add sentiment scores of each tweet

To add the sentiment scores of the tweets, I created a for loop that added the scores to lists that were then added to the dataset.

full_text positive negative neutral compound
0 I was honored to receive the first ever Presid... 0.270 0.000 0.730 0.836
1 Documents reveal that General Flynn was entrap... 0.000 0.000 1.000 0.000
2 .@SteveScully, the Never Trumper next debate m... 0.000 0.173 0.827 -0.742
3 Thank you @SenatorDole. So true! https://t.co/... 0.616 0.000 0.384 0.751
4 https://t.co/UGIAvC7VA3 0.000 0.000 1.000 0.000

Media in tweet

Next, I added the control variable for whether media was included in the tweet. As some tweets can have photos or videos while others do not, it is important to control of the differences that might affect the overal engagement. I did this by adding a variable for whether the ‘extended_entities’ varaible mentioned media or not. I used a function provided in the ‘useful functions’ file.

0    {'media': [{'id': 1314700859079524352, 'id_str...
1                                                  NaN
2                                                  NaN
3                                                  NaN
4                                                  NaN
Name: extended_entities, dtype: object
media extended_entities
0 1 {'media': [{'id': 1314700859079524352, 'id_str...
1 0 NaN
2 0 NaN
3 0 NaN
4 0 NaN

Length of Tweet

A control variable for the length of the tweet was also created. Past research has shown different length tweets have different effects (Han, Gu, & Peng, 2019), so it is therefore important to control for these differences.

0    191
1     72
2    196
3     56
4     23
Name: length, dtype: int64
index                                      int64
created_at                   datetime64[ns, UTC]
id                                         int64
id_str                                    object
full_text                                 object
truncated                                   bool
display_text_range                        object
entities                                  object
extended_entities                         object
source                                    object
in_reply_to_status_id                     object
in_reply_to_status_id_str                 object
in_reply_to_user_id                       object
in_reply_to_user_id_str                   object
in_reply_to_screen_name                   object
user                                      object
geo                                       object
coordinates                               object
place                                     object
contributors                              object
is_quote_status                             bool
retweet_count                              int64
favorite_count                             int64
favorited                                   bool
retweeted                                   bool
possibly_sensitive                        object
lang                                      object
retweeted_status                          object
quoted_status_id                         float64
quoted_status_id_str                      object
quoted_status_permalink                   object
quoted_status                             object
Trump                                      int64
Biden                                      int64
Harris                                     int64
Pence                                      int64
Republican                                 int64
is_retweet                                 int64
positive                                 float64
negative                                 float64
neutral                                  float64
compound                                 float64
media                                      int64
length                                     int64
dtype: object
negative          0
length            0
media             0
retweet_count     0
favorite_count    0
Trump             0
Republican        0
dtype: int64

No missing values in any of the varaibles of interest.

# Data Exploration and Evaluation

To begin the data exploration and evaluation process, descriptive tables were made. A summary of the descriptive statistic findings can be found at the end of this section.

count mean std min 25% 50% 75% max
retweet_count 8317.000 9504.060 15685.700 0.000 1044.000 3803.000 11897.000 415300.000
favorite_count 8317.000 45996.824 84388.376 0.000 4746.000 17474.000 54063.000 1897125.000
negative 8317.000 0.075 0.100 0.000 0.000 0.037 0.127 0.831
length 8317.000 184.580 83.943 7.000 118.000 199.000 260.000 320.000
media 8317.000 0.340 0.474 0.000 0.000 0.000 1.000 1.000
user Biden Harris Pence Trump
retweet_count count 3031.000 2760.000 928.000 1598.000
mean 9488.394 4064.164 1925.446 23330.440
std 16340.594 7964.101 2845.855 19608.835
min 11.000 2.000 68.000 0.000
25% 1626.000 696.000 514.750 11194.250
50% 4719.000 1648.500 940.000 18227.500
75% 11310.000 4281.500 1971.500 29937.750
max 327694.000 184872.000 26943.000 415300.000
favorite_count count 3031.000 2760.000 928.000 1598.000
mean 50704.301 21228.082 9487.755 101049.254
std 99206.297 44229.749 13808.130 100050.109
min 34.000 12.000 259.000 0.000
25% 7168.500 2955.000 2698.000 44051.250
50% 20899.000 7642.000 4863.500 73987.000
75% 53120.500 21452.000 9785.250 125285.250
max 1897125.000 1001691.000 167461.000 1885859.000
negative count 3031.000 2760.000 928.000 1598.000
mean 0.079 0.090 0.027 0.072
std 0.096 0.100 0.054 0.118
min 0.000 0.000 0.000 0.000
25% 0.000 0.000 0.000 0.000
50% 0.050 0.066 0.000 0.000
75% 0.133 0.147 0.038 0.117
max 0.658 0.612 0.363 0.831
Trump 0 1
negative count 6719.000 1598.000
mean 0.076 0.072
std 0.095 0.118
min 0.000 0.000
25% 0.000 0.000
50% 0.046 0.000
75% 0.128 0.117
max 0.658 0.831
Republican 0 1
retweet_count count 5791.000 2526.000
mean 6903.197 15466.689
std 13315.315 18780.076
min 2.000 0.000
25% 940.500 1476.500
50% 2830.000 10606.500
75% 7677.000 22462.000
max 327694.000 415300.000
favorite_count count 5791.000 2526.000
mean 36655.887 67411.459
std 79368.617 91379.912
min 12.000 0.000
25% 4134.000 7624.500
50% 12720.000 42340.000
75% 36720.000 94136.750
max 1897125.000 1885859.000
negative count 5791.000 2526.000
mean 0.084 0.055
std 0.098 0.102
min 0.000 0.000
25% 0.000 0.000
50% 0.057 0.000
75% 0.140 0.081
max 0.658 0.831

Distribution plots of negative sentiment scores.

<AxesSubplot:xlabel='negative', ylabel='Density'>

This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.

(0.0, 10.0)

Distribution plot of favorite count

<AxesSubplot:xlabel='favorite_count', ylabel='Density'>

This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.

(0.0, 200000.0)

Distribution plot of the logarithmic transformation of favorite count

<AxesSubplot:xlabel='favorite_count', ylabel='Density'>

Distribution plot of retweet count

(0.0, 80000.0)

Distribution plot of the logarithmic transformation of retweet count

<AxesSubplot:xlabel='retweet_count', ylabel='Density'>

Distribution plot of the length of the tweet

<AxesSubplot:xlabel='length', ylabel='Density'>

<AxesSubplot:xlabel='media', ylabel='count'>

Above is a countplot for whether media was part of the tweet or not.

<AxesSubplot:xlabel='user', ylabel='favorite_count'>

Above is a barplot for average number of favorites per tweet by user.

<AxesSubplot:xlabel='user', ylabel='retweet_count'>

Above is a barplot for average number of retweets per tweet by user.

<AxesSubplot:xlabel='user', ylabel='negative'>

Above is a barplot for average negative sentiment per tweet by user.

<AxesSubplot:xlabel='Trump', ylabel='negative'>

Above is a barplot of the average negative sentiment per tweet between Trump tweets and non-Trump tweets.

<AxesSubplot:xlabel='Republican', ylabel='negative'>

Above is a barplot of the average negative sentiment per tweet between Republican tweets and non-Republican tweets.

<AxesSubplot:xlabel='negative', ylabel='retweet_count'>

Above is a regression plot of negative sentiment against retweet count.

<AxesSubplot:xlabel='negative', ylabel='retweet_count'>

Above is a regression plot of negative sentiment against the logarithmic transformation of retweet count.

<AxesSubplot:xlabel='negative', ylabel='favorite_count'>

Above is a regression plot of negative sentiment against favorite count.

<AxesSubplot:xlabel='negative', ylabel='favorite_count'>

Above is a regression plot of negative sentiment against the logarithmic transformation of favorite count.

Summary for stakeholders

The present research project uses two different dependent variables or outcomes for the concept of engagement. The first variable is the number of retweets each tweet has received. A retweet is when the tweet is reposted by another individual. For these four users, the average number of retweets was 9,504.06 (SD = 15,685.70). Trump had by far the highest average number of retweets (M = 23,330.44; SD = 19,608.84), followed by Biden (M = 9,488.39; SD = 16,340.59), Harris (M = 4,064.16; SD = 7,964.10), and Pence (M = 1,925.45; SD = 16,340.59), respectively. The second variable is the number of favorites each tweet has received. A favorite is when the tweet is liked or ‘favorited’ by another individual. For these four users, the average number of favorites was 45,996.82 (SD = 84,388.38). Trump again had the highest average number of favorites (M = 100,050.11; SD = 100,050.11), followed by Biden (M = 50,704.30; SD = 99,206.30), Harris (M = 21,228.08; SD = 44,229.75), and Pence (M = 9,487.76; SD = 13,808.13), respectively.

As for the sentiment of the tweets, the average tweet was not very negative, with an average negative polarity of 0.08 (SD = 0.10), with 0 being neutral and 1 being completely negative. Harris was the most negative (M = 0.09; SD = 0.10), followed closely by Biden (M = 0.08; SD = 0.10) and Trump (M = 0.07; SD = 0.12), with Pence being the least negative (M = 0.03; SD = 0.05).

Turning from specific users to Presidential and Vice Presidential candidates differences by party, Republicans on average had a higher number of retweets per tweet (M = 15,466.69; SD = 18,780.07) compared to Democrats (M = 6,903.20; SD = 13,315.32) and a higher number of favorites per tweet (M = 67,411.46; SD = 91,379.91) compared to Democrats (M = 36,655.89; SD = 79,368.62). Clearly, this is driven mostly by Trump’s popularity. In terms of negativity, Democrats had a higher average negative polarity scores (M = 0.08, SD = 0.10) compared to Republicans (M = 0.06, SD = 0.10).

The average length of the tweets was 184.58 characters (SD = 83.94), and about a third (34%) of the tweets included some form of media such as a video or photograph.

The distributions for retweet count, favorite count, and negative sentiment are positively skewed due to the high number of values around zero and due to the large number of positive outliers, making the data unbalanced. As these variables are not normally distributed, this could violate the regression assumption of normality as it implies that residuals might also not be normally distributed. This can be checked with a plot of errors, and if they are not normally distributed, this could be addressed using a log transformation, as shown in the distribution plots. However, for the sake of model interpretability and machine learning predictions, this project will use the original data without transformations (except for the above regression plots). This is a possible drawback, however, which is discussed in the limitation section below.

Because of this skew, the regression plots with negativity as the IV and retweet count or favorite count as the DV are not very informative. However, when the log is taken of the DVs, there seems to be a slight positive relationship between negative sentiment and engagement, as indicated by the slope of the regression line.

Models

Model 1: retweet count without controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.217
Method:                 Least Squares   F-statistic:                     578.0
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:33   Log-Likelihood:                -91127.
No. Observations:                8317   AIC:                         1.823e+05
Df Residuals:                    8312   BIC:                         1.823e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8879.9043    280.050     31.708      0.000    8330.936    9428.872
negative    7716.0918   1547.219      4.987      0.000    4683.157    1.07e+04
Trump        1.39e+04    429.173     32.384      0.000    1.31e+04    1.47e+04
Pence      -7164.3142    526.749    -13.601      0.000   -8196.874   -6131.755
Harris     -5507.5940    365.513    -15.068      0.000   -6224.091   -4791.097
==============================================================================
Omnibus:                    12425.873   Durbin-Watson:                   1.481
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          9983022.957
Skew:                           8.916   Prob(JB):                         0.00
Kurtosis:                     171.789   Cond. No.                         11.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fit

The first model investigate the main effect of negative sentiment on retweet count. Binary variables are added for Trump, Pence, and Harris. Therefore the reference category is Biden tweets. The R-squared is 0.22, indicating 22% of the variance of retweet count is explained by the model.

Model 2: retweet count with controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.235
Model:                            OLS   Adj. R-squared:                  0.234
Method:                 Least Squares   F-statistic:                     425.3
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:33   Log-Likelihood:                -91034.
No. Observations:                8317   AIC:                         1.821e+05
Df Residuals:                    8310   BIC:                         1.821e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.409e+04    495.004     28.463      0.000    1.31e+04    1.51e+04
negative    7869.0982   1557.796      5.051      0.000    4815.429    1.09e+04
Trump       1.233e+04    445.700     27.675      0.000    1.15e+04    1.32e+04
Harris     -5772.8657    362.105    -15.943      0.000   -6482.682   -5063.049
Pence      -6395.8029    525.282    -12.176      0.000   -7425.486   -5366.119
length       -20.1916      1.941    -10.403      0.000     -23.996     -16.387
media      -3507.9851    331.520    -10.582      0.000   -4157.847   -2858.124
==============================================================================
Omnibus:                    12443.464   Durbin-Watson:                   1.486
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10317471.100
Skew:                           8.928   Prob(JB):                         0.00
Kurtosis:                     174.621   Cond. No.                     2.11e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The second model mirrors the first but includes the control variables of length and media. The R-squared improved to 0.24, so model 2 is preferred over model 1.

The positive effect of negative sentiment on retweet count can be visualized above.

Model 3: favorite count without controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.132
Model:                            OLS   Adj. R-squared:                  0.132
Method:                 Least Squares   F-statistic:                     317.4
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          1.61e-254
Time:                        18:28:34   Log-Likelihood:            -1.0555e+05
No. Observations:                8317   AIC:                         2.111e+05
Df Residuals:                    8312   BIC:                         2.111e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   5.003e+04   1586.501     31.535      0.000    4.69e+04    5.31e+04
negative    8544.2518   8765.092      0.975      0.330   -8637.514    2.57e+04
Trump       5.041e+04   2431.291     20.733      0.000    4.56e+04    5.52e+04
Harris     -2.957e+04   2070.655    -14.280      0.000   -3.36e+04   -2.55e+04
Pence      -4.078e+04   2984.066    -13.664      0.000   -4.66e+04   -3.49e+04
==============================================================================
Omnibus:                    11758.802   Durbin-Watson:                   1.428
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5440616.514
Skew:                           8.187   Prob(JB):                         0.00
Kurtosis:                     127.224   Cond. No.                         11.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fit

The third model investigate the main effect of negative sentiment on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.13, indicating 13% of the variance of favorite count is explained by the model.

Model 4: favorite count with controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.170
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     284.7
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:34   Log-Likelihood:            -1.0536e+05
No. Observations:                8317   AIC:                         2.107e+05
Df Residuals:                    8310   BIC:                         2.108e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   9.223e+04   2773.014     33.259      0.000    8.68e+04    9.77e+04
negative    1.115e+04   8726.784      1.278      0.201   -5954.323    2.83e+04
Trump       3.761e+04   2496.816     15.063      0.000    3.27e+04    4.25e+04
Harris     -3.164e+04   2028.517    -15.597      0.000   -3.56e+04   -2.77e+04
Pence      -3.484e+04   2942.633    -11.839      0.000   -4.06e+04   -2.91e+04
length      -166.7139     10.873    -15.333      0.000    -188.027    -145.401
media      -2.691e+04   1857.176    -14.490      0.000   -3.06e+04   -2.33e+04
==============================================================================
Omnibus:                    11746.789   Durbin-Watson:                   1.423
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5685698.422
Skew:                           8.148   Prob(JB):                         0.00
Kurtosis:                     130.049   Cond. No.                     2.11e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The fourth model mirrors the third but includes the control variables of length and media. The R-squared improved to 0.17, so model 4 is preferred over model 3.

The positive effect of negative sentiment on favorite count can be visualized above.

Model 5: Interaction between negative sentiment and Trump tweets on retweet count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.237
Model:                            OLS   Adj. R-squared:                  0.236
Method:                 Least Squares   F-statistic:                     368.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:35   Log-Likelihood:                -91023.
No. Observations:                8317   AIC:                         1.821e+05
Df Residuals:                    8309   BIC:                         1.821e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       1.451e+04    502.684     28.869      0.000    1.35e+04    1.55e+04
negative        3575.2435   1809.547      1.976      0.048      28.079    7122.408
Trump           1.113e+04    515.496     21.585      0.000    1.01e+04    1.21e+04
negative:Trump  1.593e+04   3428.586      4.647      0.000    9212.256    2.27e+04
Harris         -5720.2414    361.835    -15.809      0.000   -6429.528   -5010.955
Pence          -6644.8885    527.363    -12.600      0.000   -7678.652   -5611.125
length           -20.8055      1.943    -10.708      0.000     -24.614     -16.997
media          -3397.9422    331.955    -10.236      0.000   -4048.658   -2747.227
==============================================================================
Omnibus:                    12482.896   Durbin-Watson:                   1.488
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10530335.745
Skew:                           8.981   Prob(JB):                         0.00
Kurtosis:                     176.391   Cond. No.                     4.84e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The fifth model investigate the interaction effect of negative sentiment and Trump tweets on retweet count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.24, indicating 24% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on retweet count for Trump can be visualized above.

Model 6: Interaction between negative sentiment and Trump tweets on favorite count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.171
Model:                            OLS   Adj. R-squared:                  0.171
Method:                 Least Squares   F-statistic:                     245.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:36   Log-Likelihood:            -1.0536e+05
No. Observations:                8317   AIC:                         2.107e+05
Df Residuals:                    8309   BIC:                         2.108e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       9.381e+04   2818.073     33.287      0.000    8.83e+04    9.93e+04
negative       -4885.0701   1.01e+04     -0.482      0.630   -2.48e+04     1.5e+04
Trump            3.31e+04   2889.895     11.453      0.000    2.74e+04    3.88e+04
negative:Trump  5.951e+04   1.92e+04      3.096      0.002    2.18e+04    9.72e+04
Harris         -3.144e+04   2028.463    -15.500      0.000   -3.54e+04   -2.75e+04
Pence          -3.577e+04   2956.423    -12.098      0.000   -4.16e+04      -3e+04
length          -169.0068     10.892    -15.516      0.000    -190.358    -147.655
media           -2.65e+04   1860.958    -14.240      0.000   -3.01e+04   -2.29e+04
==============================================================================
Omnibus:                    11768.342   Durbin-Watson:                   1.424
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5756018.104
Skew:                           8.174   Prob(JB):                         0.00
Kurtosis:                     130.838   Cond. No.                     4.84e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The sixth model investigate the interaction effect of negative sentiment and Trump tweets on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.17, indicating 17% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on favorite count for Trump can be visualized above.

Model 7: Interaction between negative sentiment and Republican tweets on retweet count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.129
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     246.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          2.20e-246
Time:                        18:28:37   Log-Likelihood:                -91572.
No. Observations:                8317   AIC:                         1.832e+05
Df Residuals:                    8311   BIC:                         1.832e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            1.533e+04    488.693     31.365      0.000    1.44e+04    1.63e+04
negative             2267.6330   1974.335      1.149      0.251   -1602.557    6137.823
Republican           5560.0105    428.633     12.971      0.000    4719.783    6400.238
negative:Republican  3.444e+04   3487.129      9.878      0.000    2.76e+04    4.13e+04
length                -35.9548      2.003    -17.955      0.000     -39.880     -32.029
media               -4785.3198    348.785    -13.720      0.000   -5469.026   -4101.613
==============================================================================
Omnibus:                    11548.701   Durbin-Watson:                   1.330
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          6772282.307
Skew:                           7.780   Prob(JB):                         0.00
Kurtosis:                     141.926   Cond. No.                     4.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The seventh model investigate the interaction effect of negative sentiment and Republican tweets on retweet count. The R-squared is 0.13, indicating 13% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on retweet count for Republicans can be visualized above.

Model 8: Interaction between negative sentiment and Republican tweets on favorite count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     192.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          9.64e-195
Time:                        18:28:38   Log-Likelihood:            -1.0569e+05
No. Observations:                8317   AIC:                         2.114e+05
Df Residuals:                    8311   BIC:                         2.114e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            9.257e+04   2667.140     34.707      0.000    8.73e+04    9.78e+04
negative            -1.427e+04   1.08e+04     -1.324      0.185   -3.54e+04    6853.173
Republican           1.534e+04   2339.349      6.555      0.000    1.07e+04    1.99e+04
negative:Republican  1.403e+05    1.9e+04      7.374      0.000    1.03e+05    1.78e+05
length               -226.8463     10.929    -20.756      0.000    -248.270    -205.422
media               -3.134e+04   1903.565    -16.462      0.000   -3.51e+04   -2.76e+04
==============================================================================
Omnibus:                    11283.257   Durbin-Watson:                   1.329
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4524473.945
Skew:                           7.587   Prob(JB):                         0.00
Kurtosis:                     116.251   Cond. No.                     4.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The eighth model investigate the interaction effect of negative sentiment and Republican tweets on favorite count. The R-squared is 0.10, indicating 10% of the variance of favorite count is explained by the model.

The greater positive effect of negative sentiment on favorite count for Republicans can be visualized above.

Machine Learning Models for predictive analytics

Two new varaibles are created for the interaction terms: one is negative by Trump and the other is negative by Republican.

neg_trump neg_rep
0 0.000 0.000
1 0.000 0.000
2 0.173 0.173
3 0.000 0.000
4 0.000 0.000

Predictive Model for Model 5

Because the interaction term was significant in all the models, and because the R-squared slightly increased in the models that included the interaction term, the predictive models were made that mirrored that interaction models.

LinearRegression()

Above the code to create a Machine Learning model for Predictice Analytics for Model 5.

array([13471.48241824])

A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,471 retweets.

array([17046.72594331])

A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 17,047 retweets. That is an increase of 3,576 tweets.

array([24598.25359172])

A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 24,598 retweets.

array([44106.63682023])

A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 44,106 retweets. That is an increase of 19,508 tweets, a much larger increase than for Biden.

Predictive Model for Model 6

LinearRegression()
array([85354.96269929])

A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 85,354 favorites.

array([80469.89261586])

A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 80,469 favorites. That is a decrease of 4,885 favorites.

array([118453.34119362])

A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 118,453 favorites.

array([173078.08302607])

A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 173,078 favorites. That is an increase of 54,625 favorites.

Predictive Model for Model 7

LinearRegression()
array([13530.29502519])

A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,530 retweets.

array([15797.92798289])

A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 15,797 retweets.

array([19090.30549902])

A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 19,090 retweets.

array([55802.67925931])

A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 55,802 retweets.

Predictive Model for Model 8

LinearRegression()
array([81227.274972])

A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 81,227 favorites.

array([66958.12512068])

A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 66,958 favorites.

array([96562.36954165])

A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 96,562 favorites.

array([222631.76330142])

A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 222,631 favorites.

Lime for Model 5

[1.73e-01 1.00e+00 1.73e-01 0.00e+00 0.00e+00 1.96e+02 0.00e+00]
Intercept -5121.505631164369
Prediction_local [26287.6598739]
Right: 24935.6080733102

Here we can see how Model 5 came to its prediction for the second tweet in the data. A trump tweet, without media, with a length of 196 characters, and a negative polarity score of 0.17 has a predicted retweet count of 24,935.61.

Lime for Model 6

[ 0.  1.  0.  0.  0. 56.  0.]
Intercept -26271.41612258483
Prediction_local [116061.35526025]
Right: 117439.30043697983

Here we can see how Model 6 came to its prediction for the third tweet in the data. A Trump tweet, without media, with a length of 56 characters, and a negative polarity score of 0 has a predicted favorite count of 117,439.30.

Lime for Model 7

[ 0.  1.  0. 76.  0.]
Intercept 8697.157851260308
Prediction_local [18527.49277237]
Right: 18155.479650879075

Here we can see how Model 7 came to its prediction for the tenth tweet in the data. A Republican tweet, without media, with a length of 76 characters, and a negative polarity score of 0 has a predicted reweet count of 18,155.48.

Lime for Model 8

[2.14e-01 1.00e+00 2.14e-01 2.65e+02 0.00e+00]
Intercept 26688.734561201883
Prediction_local [64141.42040882]
Right: 74769.27471108577

Here we can see how Model 8 came to its prediction for the thirtieth tweet in the data. A Republican tweet, without media, with a length of 265 characters, and a negative polarity score of 0.21 has a predicted favorite count of 74,769.27.

Evaluation

Model 2 (F(6, 8310) = 425.3, p < .001, R2 = .24) is the preferred model for investigating the main effect of negativity on retweet count. The coefficient for negativity in this model is positive and is statistically significant (β = 7,869.09 , p < .001). This provides support for H1a. The predictions made by the predictive models validate this.

Model 4 (F(6, 8310) = 284.7, p < .001, R2 = .17) is the preferred model for investigating the main effect of negativity on favorite count. The coefficient for negativity in this model is positive and is statistically significant (β = 11,150 , p < .001). This provides support for H1b. Taking into consideration the significance testing models and the predictive models, it becomes clear that negative sentiment in a tweet is associated with greater engagement with that tweet.

For the interaction between negativity and Trump tweets on retweet count, Model 5 (F(7, 8309) = 368.6, p < .001, R2 = .24) is utilized. The coefficient for the interaction between negativity and Trump tweets is positive and statistically significant (β = 15,930 , p < .001). This provides support for H2a and the predictive models back this up.

For the interaction between negativity and Trump tweets on favorite count, Model 6 (F(7, 8309) = 245.6, p < .001, R2 = .17) is utilized. The coefficient for the interaction between negativity and Trump tweets is positive and statistically significant (β = 59,510, p = .002). This provides support for H2b. Taking into consideration the significance testing models and the predictive models, it becomes clear that the positive effect of negative sentiment on tweet engagement is moderated by whether the tweet was from Trump or not. In other words, negative tweets by Trump was associated with greater engagement than negative tweets by Biden.

Similar findings were found for H3a and H3b. Both Model 7 (F(5, 8311) = 246.6, p < .001, R2 = .17) and Model 8 (F(5, 8311) = 192.6, p < .001, R2 = .10) have a positive coefficient for the interaction between negativity and Republicans tweets on retweet count (β = 34,440, p < .001) and on favorite count (β = 140,300 , p < .001). These results provide solid support for H3a and H3b.

It is clear that negative sentiment is positively associated with tweet engagement, both in terms of retweet count and favorite count. It makes sense to argue that Biden should be more negative in his tweets. However, this conclusion become muddied when trying to determine whether Biden should go negative. Though doing so will increase his overall engagement, the effect will be even greater for Trump. If they get in a war of words that is increasingly negative, Trump will benefit a significant deal more than Biden. This finding is also true when comparing Trump and Pence to Biden and Harris: negative tweet sentiment is associated with a greater increase the overall Twitter engagement for Republicans relative to Democrats. As such, this project’s recommendation to the organization is not to go completely negative. The occasional negative tweet from Biden can help increase engagement, but by going completely negative, Biden runs the risk of actually helping Trump more than himself. A continuation of the status-quo, where Biden is already somewhat negative, is therefore recommended.

Limitations and Next Steps

While this project attempted to be as comprehensive as possible, there are still several limitations both to the data and the analysis that must be considered. The data is limited in two majors ways. First, as previously discussed, this data was restricted to the four politicians that are currently running for executive office of the United States. Based on this, the results and implications drawn cannot be used to generalize about other elections or other politicians. It is possible that the results would be quite different if the tweets were collected for candidates of a Senate or House race or if a similar project was conducted in a different country. Therefore, because of this data limitation, the conclusions drawn should only be for the present US Presidential Election and should not be generalized to other Presidential elections or really any other election, Presidential or not. As the present project was about the present US Presidential Election, however, this limitation is understandable. Future similar research projects could increase the generalizability of the results by including politicians for a wide range of elections, both at the local level and the national level.

Second, and more importantly, the data is also limited by Twitter’s API, as it only allows you to obtain the last 3,200 tweets from a user. While all the gathered were posted during the election cycle, they do not represent the same time span. The last 3,200 tweets for Biden date back until October 2019, while the last 3,200 tweets from Trump only date back to July 2020. This is problematic as it could lead to a confounding variable of time. For example, an event in early 2020 that led to many tweets with negative sentiment would be represented in Biden’s data, but not Trump’s. To remedy this, future projects should obtain the tweets a different way in order to include all tweets posted during the election cycle. This could be accomplished by creating or using a Twitter scrapper that is capable of scrapping all users tweets. Then all tweets that were not posted during the election cycle can be dropped from the data or excluded from the scrapper.

From an analysis perspective, this project was also limited in two major ways. As previously mentioned, the OLS regression assumption of normality could be violated due several variables not being normally distributed. While this does not bias the statistical estimates, it could undermine the significance tests. This would could be addressed by checking the normality assumption with tests such as the Durbin-Watson test. If it is found normality assumption is violated, the log of the variables could be taken or a different model that does not have the normality assumption, such as the Generalized Linear Model, could be used. Additionally, future studies should also check the other regression assumptions in addition to normality.

The final limitation is related to the use of the negative polarity scores as the variable for sentiment. There is an alternative variable of the compound sentiment score, which includes positive, neutral, and negative sentiment. As this project was focused on just the negative sentiment, it used the negative polarity scores. Since VADER also gives the compound polarity scores, future projects could investigate if the results hold when all three aspects of sentiment are included.

Ethical and Normative Considerations

All studies should consider the potential ethical and normative issues posed by their work, but this problem becomes even more acute for digital data and machine learning projects, including this one. In many regards, this ethical concerns are mitigated by the design and purpose of this project. For example, this project maintains complete respect for human autonomy, one of the guidelines for ethical AI according to the High-Level Expert Group on Artificial Intelligence (AI HLEG) established by the European Commission, as all the decisions about the tweet are still made by humans. This project was to better inform the communications team of the best use of Twitter. It did not decide which tweets were posted or how they were written. The project also has high levels of transparency and explicability as all the data is public (all tweet posts are public and the VADER sentiment package is open source), and the predictions made by the machine learning analysis can be explained through the use of LIME.

With that said, the present project is not without its ethical considerations. The first, and possibly most important, issue is the possibility to do harm (European Commission, 2018). This project investigated whether being more negative in tweet sentiment led to more engagement for four of the most prominent politicians in the world. While the overall recommendation was not to increase the amount of negative tweets, the positive effect of negative sentiment and tweet engagement was documented by this project. As such, it is possible that an organization’s takeaway is negativity increases engagement, which would lead to increasing levels of negativity in political campaigns. Further, if a similar report were to be conducted by the Trump campaign, they would certainly recommended increasing negativity in tweets as it benefits Trump more than Biden. In either case, this project or one similar to it could lead to increasing negativity in politics, would could be quite harmful both at the individual level and the societal level. Therefore, the overall harm caused by the recommendations must be taken into account.

A second concern is that this project could lead to the manipulation of consumer, or in this case, citizens (Finn & Wadhwa, 2014). Projects making recommendations about the ideal tweet sentiment could lead politicians to make insincere comments or flat-out falsehoods. For example, politicians might post very negative tweets about a topic to rally their base, even if they do not care about such issues. Further, it may lead them to negatively spin different event and policies just for the sake of increasing engagement. This would be a manipulation of the people. Instead of stating their true intentions and beliefs, projects such as these could lead politicians to mislead or lie in order to be more popular. Therefore, any recommendations made by such projects should be clear that changes in sentiment should not be conflated with changes in issues or opinions.

Finally, from a normative perspective, this project could have an effect on the Presidential Election, which could be problematic. This raises the question of what degree is society comfortable with AI helping to make decisions that could have profound effects on Presidential Elections. Further, as just mentioned, projects like these could lead to an increase in negativity in society, at least for political social media. Increasing negativity is potentially harmful to society and something society must consider when projects such as these are employed.

References

  • Carraro, L., & Castelli, L. (2010). The Implicit and Explicit Effects of Negative Political Campaigns: Is the Source Really Blamed? Political Psychology, 31(4), 617-645. doi:10.1111/j.1467-9221.2010.00771.x

  • European Commission (2019). ETHICS GUIDELINES FOR TRUSTWORTHY AI. High-Level Expert Group on Artificial Intelligence. https://ec.europa.eu/futurium/en/ai-alliance-consultation

  • Finn, R. L., & Wadhwa, K. (2014). The ethics of “smart” advertising and regulatory initiatives in the consumer intelligence industry. Info, 16(3), 22-39. doi:10.1108/info-12-2013-0059

  • Gilbert, C. H. E., & Hutto, E. (2014, June). Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14).

  • Han, X., Gu, X., & Peng, S. (2019). Analysis of Tweet Form’s effect on users’ engagement on Twitter. Cogent Business & Management, 6(1). doi:10.1080/23311975.2018.1564168

  • Jost, J. T. (2017). Ideological Asymmetries and the Essence of Political Psychology. Political Psychology, 38(2), 167-208. doi:10.1111/pops.12407

  • Jost, J. T., Glaser, J., Kruglanski, A. W., & Sulloway, F. J. (2003). Political conservatism as motivated social cognition. Psychological Bulletin, 129(3), 339-375. doi:10.1037/0033-2909.129.3.339

  • Lee, J., & Xu, W. (2018). The more attacks, the more retweets: Trump’s and Clinton’s agenda setting on Twitter. Public Relations Review, 44(2), 201-213. doi:10.1016/j.pubrev.2017.10.002

  • Ott, B. L. (2016). The age of Twitter: Donald J. Trump and the politics of debasement. Critical Studies in Media Communication, 34(1), 59-68. doi:10.1080/15295036.2016.1266686

  • Oz, M., Zheng, P., & Chen, G. M. (2017). Twitter versus Facebook: Comparing incivility, impoliteness, and deliberative attributes. New Media & Society, 20(9), 3400-3419. doi:10.1177/1461444817749516

  • Soroka, S., & Mcadams, S. (2015). News, Politics, and Negativity. Political Communication, 32(1), 1-22. doi:10.1080/10584609.2014.881942

  • Soroka, S., Fournier, P., & Nir, L. (2019). Cross-national evidence of a negativity bias in psychophysiological reactions to news. Proceedings of the National Academy of Sciences, 116(38), 18888-18892. doi:10.1073/pnas.1908369116

  • Trussler, M., & Soroka, S. (2014). Consumer Demand for Cynical and Negative News Frames. The International Journal of Press/Politics, 19(3), 360-379. doi:10.1177/1940161214524832