Twitter Sentiment and Engagement: The Case of the Biden Campaign

Sentiment Analysis
Using several different predictive modeling strategies, I investigated how sentiment influences political reach on Twitteras well as created prediction models to predict the reach of a vet-to-be-tweeted tweet. This project found negative political tweets had higher engagement, both in terms of favorites and retweets, than non-negative tweets.
Author

Neil Fasching

Published

October 1, 2021

Twitter Sentiment and Engagement: The Case of the Biden Campaign

  • Introduction: 310
  • Hypotheses & Sub-RQs: 523
  • Gathering data: 524
  • Data Exploration & Evaluation: 562
  • Evaluation: 548
  • Limitations and Next Steps: 544
  • Ethical and Normative Considerations: 559

Introduction

As the Presidential Election for the United States draws nearer, the Joe Biden campaign has run into a problem with its Twitter campaign. With just a few weeks left before the election, the Biden communication department is sharply divided in how to use his Twitter account in the final stretch of the campaign. Several members of the communications staff believe that in order to drum up enthusiasm among Biden supporters, his Twitter account should be used for negative campaigning. This would include character attacks and policy attacks against Donald Trump. Other members of Biden’s communication team believe just the opposite: negative campaigning will backfire for Biden. While negativity might help Biden rally some of his supporters, they argue, this will lead Trump to also go negative, which will benefit him more than Biden. Further, they argue that Democrats are different than Republicans and won’t react as favorably to the negativity as Trump supporters. As such the campaign has this research question:

RQ: Do tweets from political candidates that contain negative sentiment receive more engagement than tweets from political candidates that are not negative?

Many communication challenges cannot be solved by the use of digital data. However, as the present RQ question boils down to how different types of social media posts lead to different levels of online engagement with that post, this problem is one that should be looked at through the lens of digital data. Further, this case is relevant both from a theoretical perspective as well as a societal perspective. There has been ample research both into the negativity bias (Soroka & McAdams, 2015) as well as negative campaigning (Carraro & Castelli, 2010). This case will add to the research into whether a negativity bias also exists for political tweets as well as flush out the efficacy of negative campaigning on Twitter. For society, this research could also affect the campaign style of the Presidential race.

Hypotheses

People have a “negativity bias” when it comes to consuming news content, with individuals putting more weight and attention on negative information (Trussler & Soroka, 2014). Negative news, also known as “adverse media,” is news that focuses on unfavorable information and is often defined by its negative tone (Soroka, Fournier, & Nir, 2019). Studies have shown that people pay more attention to negative information than to positive information and are more likely to engage with it (Soroka & McAdams, 2015). As such, it is logical to think that negative tweets, or tweets with a negative sentiment, are more likely to attract the attention of Twitter users and lead to more engagement. Past research lends this support. Oz, Zheng, and Chen (2017) found that negative tweets had higher engagement than non-negative tweets when it comes to responses to White House’s Facebook and Twitter pages. Therefore, based on this argument by the members of Biden’s communication staff that argue in favor of negativity, the first two hypotheses are:

H1a: Negative sentiment in a tweet will be positively associated with number of retweets with the tweet.

H1b: Negative sentiment in a tweet will be positively associated with number of favorites with the tweet.

The opponents of the negative campaign strategy, however, have a valid point. Trump is a special case, who, as an avid twitter user, often resorts to coarse language, personal attacks, and straight incivility (Ott, 2017). Trump’s followers are not only more accustomed to the use of negative sentiment, they have actually shown a strong preference for tweets that include personal attacks (Lee & Xu, 2018). Therefore, while negativity might help Biden, it would help Trump even more. If the campaign becomes more negative on Twitter, that could backfire, leading Trump to be more negative and increasing his Twitter engagement. As such, the second set of hypotheses are:

H2a: The positive effect of negative sentiment on number of retweets will be greater for Trump tweets than for Biden tweets.

H2b: The positive effect of negative sentiment on number of favorites will be greater for Trump tweets than for Biden tweets.

Finally, the opponents of the negative campaign strategy also contend that Republicans are different than Democrats. The extensive work into Ideological Asymmetries by Jost (2017) backs this up. As people choose an ideology that aligns with their own psychological motivations, people of different ideologies are likely to have psychological differences. For example, research shows that Republicans a greater need to manage uncertainty and fear, while Democrats are more willing to accept some level of uncertainty in the hopes of social progress (Jost et al., 2003). It is possible that Democrats and Republicans also respond differently to the negativity. While personal attacks may work well with Republicans, that might not be the case for Democrats. Therefore, the final set of hypotheses are:

H3a: The positive effect of negative sentiment on number of retweets will be greater for Trump and Pence tweets than for Biden and Harris tweets.

H3b: The positive effect of negative sentiment on number of favorites will be greater for Trump and Pence tweets than for Biden and Harris tweets.

Data Collection

As the business challenge involves comparing the tweets, the first step in gathering the data was the obtain the last the recent tweets of Donald Trump, Joe Biden, Kamala Harris, and Mike Pence. To do this, the last 3,200 tweets from each twitter were gathered using Twitter’s API on 10 October, 2020. This method was chosen for two reasons: first, as opposed to scrapping the tweets that can often miss collecting relevant data, by using Twitter’s API, we can be reasonably confident that all of the planned tweets were gathered. Second, from a logistical standpoint, the present study is only concerned with recent Twitters that were posted during the election cycle. As Twitter’s API only allows the latest 3,200 tweets from a single user to be downloaded. This could be a problem if all user tweets were required, but since the focus is on the election, the last 3,200 tweets is sufficient.

In additional to obtaining the text of each tweet, the API downloaded some accompanying data, such as time of post, language of post, and whether media was included with the post. Also, relevant to this project, the API includes data of the overall engagement with each tweet, namely number of retweets and number of favorites.

In order to obtain the sentiment of the tweets, VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis will be run on each tweet, and the negative, positive, neutral, and compound polarity scores will be added to the dataset. VADER was chosen as it is quite good at analyzing social media posts (Hutto & Gilbert, 2014).

As for privacy, the tweets will be linked to the individual users, which does pose a problem for the privacy of the twitter user. For example, they may not wish for their tweets to be included in a sentiment analysis. However, as the accounts are used in public campaigns for political office, it would seem likely that the other campaigns are also investigating their twitter data, which mitigates potential privacy concerns. Further, the privacy of the users engaging with the tweets, whether by retweeting or favoriting a post, is protected as no data is collected on those users.

While the reasoning behind the use of Twitter’s API is sound, this does not mean the data is without potential biases. The first bias could be related to the timing of the tweets. Twitter uses tweet at different rates, so the last 3,200 tweets from Trump could represent a much shorter timespan than the last 3,200 tweets from Biden, and therefore could bias the data based on different temporal factors between users. Secondly, there is a clear bias against women and people of color in the dataset. As the dataset contains tweets of three white men and only one woman, the data is skewed towards representing white men. And finally, as only one election at one time is being investigate, the generalizability of the data to other elections could be questioned. That said, as the outcome variable is tweet engagement and not something like loan approval, there are no known unwarranted associations between the outcome and protected features such as race and gender.

Above are the needed packages for the project.

Get Tweets

Above is the code to retrieve the last 3,200 tweets by a user. This code was retrieved from the GetLatest3200TweetsFromUser file.

Code to indicate of which users to collect the tweets.

collecting tweets from user:  realDonaldTrump (maximum rounds = 16)
collected 200 tweets from realDonaldTrump in round 1  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 2  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 3  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 4  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 5  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 6  || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 7  || waiting for 15 seconds
collected 190 tweets from realDonaldTrump in round 8  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 9  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 10  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 11  || waiting for 15 seconds
collected 198 tweets from realDonaldTrump in round 12  || waiting for 15 seconds
collected 199 tweets from realDonaldTrump in round 13  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 14  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 15  || waiting for 15 seconds
collected 200 tweets from realDonaldTrump in round 16  || waiting for 15 seconds
realDonaldTrump completed
collecting tweets from user:  JoeBiden (maximum rounds = 16)
collected 200 tweets from JoeBiden in round 1  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 2  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 3  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 4  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 5  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 6  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 7  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 8  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 9  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 10  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 11  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 12  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 13  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 14  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 15  || waiting for 15 seconds
collected 200 tweets from JoeBiden in round 16  || waiting for 15 seconds
JoeBiden completed
collecting tweets from user:  KamalaHarris (maximum rounds = 16)
collected 200 tweets from KamalaHarris in round 1  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 2  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 3  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 4  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 5  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 6  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 7  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 8  || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 9  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 10  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 11  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 12  || waiting for 15 seconds
collected 199 tweets from KamalaHarris in round 13  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 14  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 15  || waiting for 15 seconds
collected 200 tweets from KamalaHarris in round 16  || waiting for 15 seconds
KamalaHarris completed
collecting tweets from user:  Mike_Pence (maximum rounds = 16)
collected 200 tweets from Mike_Pence in round 1  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 2  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 3  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 4  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 5  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 6  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 7  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 8  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 9  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 10  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 11  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 12  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 13  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 14  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 15  || waiting for 15 seconds
collected 200 tweets from Mike_Pence in round 16  || waiting for 15 seconds
Mike_Pence completed

Above loop retrieves all tweets. The code has been made into a comment so the data remains the same if all the code is run again.

Trump data

created_at id id_str full_text truncated display_text_range entities extended_entities source in_reply_to_status_id ... favorite_count favorited retweeted possibly_sensitive lang retweeted_status quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status
0 Sat Oct 10 03:09:32 +0000 2020 1314764977597755392 1314764977597755392 I was honored to receive the first ever Presid... False [0, 191] {'hashtags': [{'text': 'LESM', 'indices': [162... {'media': [{'id': 1314700859079524352, 'id_str... <a href="http://twitter.com/download/iphone" r... nan ... 85771 False False False en NaN nan NaN NaN NaN
1 Sat Oct 10 02:36:30 +0000 2020 1314756664143347712 1314756664143347712 RT @marklevinshow: My interview with the presi... False [0, 129] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 0 False False False en {'created_at': 'Fri Oct 09 23:35:36 +0000 2020... nan NaN NaN NaN
2 Fri Oct 09 23:55:24 +0000 2020 1314716123250778114 1314716123250778114 RT @realDonaldTrump: Will be in Sanford, Flori... False [0, 104] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 0 False False False en {'created_at': 'Fri Oct 09 21:04:39 +0000 2020... nan NaN NaN NaN
3 Fri Oct 09 23:35:09 +0000 2020 1314711027326562306 1314711027326562306 Documents reveal that General Flynn was entrap... False [0, 72] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 140093 False False NaN en NaN nan NaN NaN NaN
4 Fri Oct 09 23:31:20 +0000 2020 1314710067699159041 1314710067699159041 .@SteveScully, the Never Trumper next debate m... False [0, 196] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... nan ... 121620 False False NaN en NaN nan NaN NaN NaN

5 rows × 31 columns

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            2514
source                          0
in_reply_to_status_id        3074
in_reply_to_status_id_str    3074
in_reply_to_user_id          3071
in_reply_to_user_id_str      3071
in_reply_to_screen_name      3071
user                            0
geo                          3165
coordinates                  3165
place                        3165
contributors                 3165
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive           1837
lang                            0
retweeted_status             1598
quoted_status_id             2603
quoted_status_id_str         2603
quoted_status_permalink      2603
quoted_status                2830
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3165

The Trump dataset is imported. Variables are added for indicate the tweets are from Trump, who is a Republican. Then missing values are checked for the text of the tweet as well as the newly created variables. Finally, I check the lengths of the datasets. The same is then done for Joe Biden, Kamala Harris, and Mike Pence.

Biden Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
extended_entities            2113
source                          0
in_reply_to_status_id        3085
in_reply_to_status_id_str    3085
in_reply_to_user_id          3085
in_reply_to_user_id_str      3085
in_reply_to_screen_name      3085
user                            0
geo                          3185
coordinates                  3185
place                        3185
contributors                 3185
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
possibly_sensitive           1007
lang                            0
quoted_status_id             2733
quoted_status_id_str         2733
quoted_status_permalink      2733
quoted_status                2745
retweeted_status             3031
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3185

Harris Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
source                          0
in_reply_to_status_id        3129
in_reply_to_status_id_str    3129
in_reply_to_user_id          3129
in_reply_to_user_id_str      3129
in_reply_to_screen_name      3129
user                            0
geo                          3183
coordinates                  3183
place                        3182
contributors                 3183
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
lang                            0
possibly_sensitive           1272
retweeted_status             2760
extended_entities            2370
quoted_status_id             2706
quoted_status_id_str         2706
quoted_status_permalink      2706
quoted_status                2729
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64
3183

Pence Data

created_at                      0
id                              0
id_str                          0
full_text                       0
truncated                       0
display_text_range              0
entities                        0
source                          0
in_reply_to_status_id        3124
in_reply_to_status_id_str    3124
in_reply_to_user_id          3124
in_reply_to_user_id_str      3124
in_reply_to_screen_name      3124
user                            0
geo                          3185
coordinates                  3185
place                        3185
contributors                 3185
retweeted_status              928
is_quote_status                 0
retweet_count                   0
favorite_count                  0
favorited                       0
retweeted                       0
lang                            0
possibly_sensitive           1814
extended_entities            2181
quoted_status_id             3066
quoted_status_id_str         3066
quoted_status_permalink      3066
quoted_status                3167
Trump                           0
Biden                           0
Harris                          0
Pence                           0
Republican                      0
dtype: int64

Merge

1.0

Finally, the four datasets are merged. Then a quick check is run to make sure the length of the new dataset is correct.

Data Cleaning

First, a simple inspection of the data is performed.

full_text retweet_count favorite_count
0 I was honored to receive the first ever Presid... 20884 85771
1 RT @marklevinshow: My interview with the presi... 17307 0
2 RT @realDonaldTrump: Will be in Sanford, Flori... 25471 0
3 Documents reveal that General Flynn was entrap... 41969 140093
4 .@SteveScully, the Never Trumper next debate m... 33220 121620
12718
created_at                       0
id                               0
id_str                           0
full_text                        0
truncated                        0
display_text_range               0
entities                         0
extended_entities             9178
source                           0
in_reply_to_status_id        12412
in_reply_to_status_id_str    12412
in_reply_to_user_id          12409
in_reply_to_user_id_str      12409
in_reply_to_screen_name      12409
user                             0
geo                          12718
coordinates                  12718
place                        12717
contributors                 12718
is_quote_status                  0
retweet_count                    0
favorite_count                   0
favorited                        0
retweeted                        0
possibly_sensitive            5930
lang                             0
retweeted_status              8317
quoted_status_id             11108
quoted_status_id_str         11108
quoted_status_permalink      11108
quoted_status                11471
Trump                            0
Biden                            0
Harris                           0
Pence                            0
Republican                       0
dtype: int64

Drop Retweets

The first task was to drop unwanted observations. For this project, tweets that are retweets are not of interest. This was decided for two reasons. First, the research question and hypotheses were about the negativity of Biden’s tweets. This is about the tweets he writes, not the tweets written by other people. It therefore makes sense to exclude retweets. Second, from a more practical standpoint, retweets are not favorited, only the original tweet can be favorited. Therefore, all retweets have a favorite count of zero, which is not an accurate representation of how much people liked or engaged with the retweet. Therefore, it was decided to drop all retweets from the dataset. To do so, a new variable was created to determine if the tweet was a retweet, and if it was, it was dropped.

full_text is_retweet
0 I was honored to receive the first ever Presid... 0
1 RT @marklevinshow: My interview with the presi... 1
2 RT @realDonaldTrump: Will be in Sanford, Flori... 1
3 Documents reveal that General Flynn was entrap... 0
4 .@SteveScully, the Never Trumper next debate m... 0
4401

4,401 of the tweets were retweets.

8317

The new dataset has 8,317 tweets, none of which are retweets.

full_text retweet_count favorite_count is_retweet
0 I was honored to receive the first ever Presid... 20884 85771 0
3 Documents reveal that General Flynn was entrap... 41969 140093 0
4 .@SteveScully, the Never Trumper next debate m... 33220 121620 0
5 Thank you @SenatorDole. So true! https://t.co/... 15147 58881 0
6 https://t.co/UGIAvC7VA3 19078 54239 0

The index of the dataset was then reset.

Check date of Tweets

Next, it was important to ensure that none of the tweets were from before the election cycle, so the date created variable was changed into a datetime variable.

0    Sat Oct 10 03:09:32 +0000 2020
1    Fri Oct 09 23:35:09 +0000 2020
2    Fri Oct 09 23:31:20 +0000 2020
3    Fri Oct 09 23:01:54 +0000 2020
4    Fri Oct 09 22:30:20 +0000 2020
Name: created_at, dtype: object
0   2020-10-10 03:09:32+00:00
1   2020-10-09 23:35:09+00:00
2   2020-10-09 23:31:20+00:00
3   2020-10-09 23:01:54+00:00
4   2020-10-09 22:30:20+00:00
Name: created_at, dtype: datetime64[ns, UTC]
count                          8317
unique                         8191
top       2020-05-19 22:23:51+00:00
freq                              4
first     2019-08-05 17:58:00+00:00
last      2020-10-10 03:09:32+00:00
Name: created_at, dtype: object

The oldest tweet is from August 5th, 2019. This is after all four had begun campaigning so no tweets need to be dropped.

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
7388 199 2019-08-05 17:58:00+00:00 1158437011692429314 1158437011692429314 Gun violence is an epidemic. It impacts our co... False [0, 179] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="https://sproutsocial.com" rel="nofoll... ... 1158211041999970304.000 1158211041999970317 {'url': 'https://t.co/GqZAZurc8D', 'expanded':... {'created_at': 'Mon Aug 05 03:00:05 +0000 2019... 0 0 1 0 0 0

1 rows × 38 columns

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
4628 199 2019-10-26 21:03:00+00:00 1188199370463821824 1188199370463821824 If you work hard, you should be able to share ... False [0, 276] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="https://about.twitter.com/products/tw... ... nan NaN NaN NaN 0 1 0 0 0 0

1 rows × 38 columns

index created_at id id_str full_text truncated display_text_range entities extended_entities source ... quoted_status_id quoted_status_id_str quoted_status_permalink quoted_status Trump Biden Harris Pence Republican is_retweet
1597 185 2020-07-17 16:25:03+00:00 1284162207232733185 1284162207232733185 THANK YOU to the 5 million members of the @NRA... False [0, 284] {'hashtags': [], 'symbols': [], 'user_mentions... NaN <a href="http://twitter.com/download/iphone" r... ... 1283748224243728384.000 1283748224243728384 {'url': 'https://t.co/8ZhChqxgBI', 'expanded':... {'created_at': 'Thu Jul 16 13:00:02 +0000 2020... 1 0 0 0 1 0

1 rows × 38 columns

Add sentiment scores of each tweet

To add the sentiment scores of the tweets, I created a for loop that added the scores to lists that were then added to the dataset.

full_text positive negative neutral compound
0 I was honored to receive the first ever Presid... 0.270 0.000 0.730 0.836
1 Documents reveal that General Flynn was entrap... 0.000 0.000 1.000 0.000
2 .@SteveScully, the Never Trumper next debate m... 0.000 0.173 0.827 -0.742
3 Thank you @SenatorDole. So true! https://t.co/... 0.616 0.000 0.384 0.751
4 https://t.co/UGIAvC7VA3 0.000 0.000 1.000 0.000

Media in tweet

Next, I added the control variable for whether media was included in the tweet. As some tweets can have photos or videos while others do not, it is important to control of the differences that might affect the overal engagement. I did this by adding a variable for whether the ‘extended_entities’ varaible mentioned media or not. I used a function provided in the ‘useful functions’ file.

0    {'media': [{'id': 1314700859079524352, 'id_str...
1                                                  NaN
2                                                  NaN
3                                                  NaN
4                                                  NaN
Name: extended_entities, dtype: object
media extended_entities
0 1 {'media': [{'id': 1314700859079524352, 'id_str...
1 0 NaN
2 0 NaN
3 0 NaN
4 0 NaN

Length of Tweet

A control variable for the length of the tweet was also created. Past research has shown different length tweets have different effects (Han, Gu, & Peng, 2019), so it is therefore important to control for these differences.

0    191
1     72
2    196
3     56
4     23
Name: length, dtype: int64
index                                      int64
created_at                   datetime64[ns, UTC]
id                                         int64
id_str                                    object
full_text                                 object
truncated                                   bool
display_text_range                        object
entities                                  object
extended_entities                         object
source                                    object
in_reply_to_status_id                     object
in_reply_to_status_id_str                 object
in_reply_to_user_id                       object
in_reply_to_user_id_str                   object
in_reply_to_screen_name                   object
user                                      object
geo                                       object
coordinates                               object
place                                     object
contributors                              object
is_quote_status                             bool
retweet_count                              int64
favorite_count                             int64
favorited                                   bool
retweeted                                   bool
possibly_sensitive                        object
lang                                      object
retweeted_status                          object
quoted_status_id                         float64
quoted_status_id_str                      object
quoted_status_permalink                   object
quoted_status                             object
Trump                                      int64
Biden                                      int64
Harris                                     int64
Pence                                      int64
Republican                                 int64
is_retweet                                 int64
positive                                 float64
negative                                 float64
neutral                                  float64
compound                                 float64
media                                      int64
length                                     int64
dtype: object
negative          0
length            0
media             0
retweet_count     0
favorite_count    0
Trump             0
Republican        0
dtype: int64

No missing values in any of the varaibles of interest.

# Data Exploration and Evaluation

To begin the data exploration and evaluation process, descriptive tables were made. A summary of the descriptive statistic findings can be found at the end of this section.

count mean std min 25% 50% 75% max
retweet_count 8317.000 9504.060 15685.700 0.000 1044.000 3803.000 11897.000 415300.000
favorite_count 8317.000 45996.824 84388.376 0.000 4746.000 17474.000 54063.000 1897125.000
negative 8317.000 0.075 0.100 0.000 0.000 0.037 0.127 0.831
length 8317.000 184.580 83.943 7.000 118.000 199.000 260.000 320.000
media 8317.000 0.340 0.474 0.000 0.000 0.000 1.000 1.000
user Biden Harris Pence Trump
retweet_count count 3031.000 2760.000 928.000 1598.000
mean 9488.394 4064.164 1925.446 23330.440
std 16340.594 7964.101 2845.855 19608.835
min 11.000 2.000 68.000 0.000
25% 1626.000 696.000 514.750 11194.250
50% 4719.000 1648.500 940.000 18227.500
75% 11310.000 4281.500 1971.500 29937.750
max 327694.000 184872.000 26943.000 415300.000
favorite_count count 3031.000 2760.000 928.000 1598.000
mean 50704.301 21228.082 9487.755 101049.254
std 99206.297 44229.749 13808.130 100050.109
min 34.000 12.000 259.000 0.000
25% 7168.500 2955.000 2698.000 44051.250
50% 20899.000 7642.000 4863.500 73987.000
75% 53120.500 21452.000 9785.250 125285.250
max 1897125.000 1001691.000 167461.000 1885859.000
negative count 3031.000 2760.000 928.000 1598.000
mean 0.079 0.090 0.027 0.072
std 0.096 0.100 0.054 0.118
min 0.000 0.000 0.000 0.000
25% 0.000 0.000 0.000 0.000
50% 0.050 0.066 0.000 0.000
75% 0.133 0.147 0.038 0.117
max 0.658 0.612 0.363 0.831
Trump 0 1
negative count 6719.000 1598.000
mean 0.076 0.072
std 0.095 0.118
min 0.000 0.000
25% 0.000 0.000
50% 0.046 0.000
75% 0.128 0.117
max 0.658 0.831
Republican 0 1
retweet_count count 5791.000 2526.000
mean 6903.197 15466.689
std 13315.315 18780.076
min 2.000 0.000
25% 940.500 1476.500
50% 2830.000 10606.500
75% 7677.000 22462.000
max 327694.000 415300.000
favorite_count count 5791.000 2526.000
mean 36655.887 67411.459
std 79368.617 91379.912
min 12.000 0.000
25% 4134.000 7624.500
50% 12720.000 42340.000
75% 36720.000 94136.750
max 1897125.000 1885859.000
negative count 5791.000 2526.000
mean 0.084 0.055
std 0.098 0.102
min 0.000 0.000
25% 0.000 0.000
50% 0.057 0.000
75% 0.140 0.081
max 0.658 0.831

Distribution plots of negative sentiment scores.

<AxesSubplot:xlabel='negative', ylabel='Density'>

This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.

(0.0, 10.0)

Distribution plot of favorite count

<AxesSubplot:xlabel='favorite_count', ylabel='Density'>

This plot isn’t the most informative, as many of the values are 0. To have a more informative graph, the distribtion plot was zoomed in.

(0.0, 200000.0)

Distribution plot of the logarithmic transformation of favorite count

<AxesSubplot:xlabel='favorite_count', ylabel='Density'>

Distribution plot of retweet count

(0.0, 80000.0)

Distribution plot of the logarithmic transformation of retweet count

<AxesSubplot:xlabel='retweet_count', ylabel='Density'>

Distribution plot of the length of the tweet

<AxesSubplot:xlabel='length', ylabel='Density'>

<AxesSubplot:xlabel='media', ylabel='count'>

Above is a countplot for whether media was part of the tweet or not.

<AxesSubplot:xlabel='user', ylabel='favorite_count'>

Above is a barplot for average number of favorites per tweet by user.

<AxesSubplot:xlabel='user', ylabel='retweet_count'>

Above is a barplot for average number of retweets per tweet by user.

<AxesSubplot:xlabel='user', ylabel='negative'>

Above is a barplot for average negative sentiment per tweet by user.

<AxesSubplot:xlabel='Trump', ylabel='negative'>

Above is a barplot of the average negative sentiment per tweet between Trump tweets and non-Trump tweets.

<AxesSubplot:xlabel='Republican', ylabel='negative'>

Above is a barplot of the average negative sentiment per tweet between Republican tweets and non-Republican tweets.

<AxesSubplot:xlabel='negative', ylabel='retweet_count'>

Above is a regression plot of negative sentiment against retweet count.

<AxesSubplot:xlabel='negative', ylabel='retweet_count'>

Above is a regression plot of negative sentiment against the logarithmic transformation of retweet count.

<AxesSubplot:xlabel='negative', ylabel='favorite_count'>

Above is a regression plot of negative sentiment against favorite count.

<AxesSubplot:xlabel='negative', ylabel='favorite_count'>

Above is a regression plot of negative sentiment against the logarithmic transformation of favorite count.

Summary for stakeholders

The present research project uses two different dependent variables or outcomes for the concept of engagement. The first variable is the number of retweets each tweet has received. A retweet is when the tweet is reposted by another individual. For these four users, the average number of retweets was 9,504.06 (SD = 15,685.70). Trump had by far the highest average number of retweets (M = 23,330.44; SD = 19,608.84), followed by Biden (M = 9,488.39; SD = 16,340.59), Harris (M = 4,064.16; SD = 7,964.10), and Pence (M = 1,925.45; SD = 16,340.59), respectively. The second variable is the number of favorites each tweet has received. A favorite is when the tweet is liked or ‘favorited’ by another individual. For these four users, the average number of favorites was 45,996.82 (SD = 84,388.38). Trump again had the highest average number of favorites (M = 100,050.11; SD = 100,050.11), followed by Biden (M = 50,704.30; SD = 99,206.30), Harris (M = 21,228.08; SD = 44,229.75), and Pence (M = 9,487.76; SD = 13,808.13), respectively.

As for the sentiment of the tweets, the average tweet was not very negative, with an average negative polarity of 0.08 (SD = 0.10), with 0 being neutral and 1 being completely negative. Harris was the most negative (M = 0.09; SD = 0.10), followed closely by Biden (M = 0.08; SD = 0.10) and Trump (M = 0.07; SD = 0.12), with Pence being the least negative (M = 0.03; SD = 0.05).

Turning from specific users to Presidential and Vice Presidential candidates differences by party, Republicans on average had a higher number of retweets per tweet (M = 15,466.69; SD = 18,780.07) compared to Democrats (M = 6,903.20; SD = 13,315.32) and a higher number of favorites per tweet (M = 67,411.46; SD = 91,379.91) compared to Democrats (M = 36,655.89; SD = 79,368.62). Clearly, this is driven mostly by Trump’s popularity. In terms of negativity, Democrats had a higher average negative polarity scores (M = 0.08, SD = 0.10) compared to Republicans (M = 0.06, SD = 0.10).

The average length of the tweets was 184.58 characters (SD = 83.94), and about a third (34%) of the tweets included some form of media such as a video or photograph.

The distributions for retweet count, favorite count, and negative sentiment are positively skewed due to the high number of values around zero and due to the large number of positive outliers, making the data unbalanced. As these variables are not normally distributed, this could violate the regression assumption of normality as it implies that residuals might also not be normally distributed. This can be checked with a plot of errors, and if they are not normally distributed, this could be addressed using a log transformation, as shown in the distribution plots. However, for the sake of model interpretability and machine learning predictions, this project will use the original data without transformations (except for the above regression plots). This is a possible drawback, however, which is discussed in the limitation section below.

Because of this skew, the regression plots with negativity as the IV and retweet count or favorite count as the DV are not very informative. However, when the log is taken of the DVs, there seems to be a slight positive relationship between negative sentiment and engagement, as indicated by the slope of the regression line.

Models

Model 1: retweet count without controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.217
Method:                 Least Squares   F-statistic:                     578.0
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:33   Log-Likelihood:                -91127.
No. Observations:                8317   AIC:                         1.823e+05
Df Residuals:                    8312   BIC:                         1.823e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8879.9043    280.050     31.708      0.000    8330.936    9428.872
negative    7716.0918   1547.219      4.987      0.000    4683.157    1.07e+04
Trump        1.39e+04    429.173     32.384      0.000    1.31e+04    1.47e+04
Pence      -7164.3142    526.749    -13.601      0.000   -8196.874   -6131.755
Harris     -5507.5940    365.513    -15.068      0.000   -6224.091   -4791.097
==============================================================================
Omnibus:                    12425.873   Durbin-Watson:                   1.481
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          9983022.957
Skew:                           8.916   Prob(JB):                         0.00
Kurtosis:                     171.789   Cond. No.                         11.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fit

The first model investigate the main effect of negative sentiment on retweet count. Binary variables are added for Trump, Pence, and Harris. Therefore the reference category is Biden tweets. The R-squared is 0.22, indicating 22% of the variance of retweet count is explained by the model.

Model 2: retweet count with controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.235
Model:                            OLS   Adj. R-squared:                  0.234
Method:                 Least Squares   F-statistic:                     425.3
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:33   Log-Likelihood:                -91034.
No. Observations:                8317   AIC:                         1.821e+05
Df Residuals:                    8310   BIC:                         1.821e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1.409e+04    495.004     28.463      0.000    1.31e+04    1.51e+04
negative    7869.0982   1557.796      5.051      0.000    4815.429    1.09e+04
Trump       1.233e+04    445.700     27.675      0.000    1.15e+04    1.32e+04
Harris     -5772.8657    362.105    -15.943      0.000   -6482.682   -5063.049
Pence      -6395.8029    525.282    -12.176      0.000   -7425.486   -5366.119
length       -20.1916      1.941    -10.403      0.000     -23.996     -16.387
media      -3507.9851    331.520    -10.582      0.000   -4157.847   -2858.124
==============================================================================
Omnibus:                    12443.464   Durbin-Watson:                   1.486
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10317471.100
Skew:                           8.928   Prob(JB):                         0.00
Kurtosis:                     174.621   Cond. No.                     2.11e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The second model mirrors the first but includes the control variables of length and media. The R-squared improved to 0.24, so model 2 is preferred over model 1.

The positive effect of negative sentiment on retweet count can be visualized above.

Model 3: favorite count without controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.132
Model:                            OLS   Adj. R-squared:                  0.132
Method:                 Least Squares   F-statistic:                     317.4
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          1.61e-254
Time:                        18:28:34   Log-Likelihood:            -1.0555e+05
No. Observations:                8317   AIC:                         2.111e+05
Df Residuals:                    8312   BIC:                         2.111e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   5.003e+04   1586.501     31.535      0.000    4.69e+04    5.31e+04
negative    8544.2518   8765.092      0.975      0.330   -8637.514    2.57e+04
Trump       5.041e+04   2431.291     20.733      0.000    4.56e+04    5.52e+04
Harris     -2.957e+04   2070.655    -14.280      0.000   -3.36e+04   -2.55e+04
Pence      -4.078e+04   2984.066    -13.664      0.000   -4.66e+04   -3.49e+04
==============================================================================
Omnibus:                    11758.802   Durbin-Watson:                   1.428
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5440616.514
Skew:                           8.187   Prob(JB):                         0.00
Kurtosis:                     127.224   Cond. No.                         11.1
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Fit

The third model investigate the main effect of negative sentiment on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.13, indicating 13% of the variance of favorite count is explained by the model.

Model 4: favorite count with controls

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.170
Model:                            OLS   Adj. R-squared:                  0.170
Method:                 Least Squares   F-statistic:                     284.7
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:34   Log-Likelihood:            -1.0536e+05
No. Observations:                8317   AIC:                         2.107e+05
Df Residuals:                    8310   BIC:                         2.108e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   9.223e+04   2773.014     33.259      0.000    8.68e+04    9.77e+04
negative    1.115e+04   8726.784      1.278      0.201   -5954.323    2.83e+04
Trump       3.761e+04   2496.816     15.063      0.000    3.27e+04    4.25e+04
Harris     -3.164e+04   2028.517    -15.597      0.000   -3.56e+04   -2.77e+04
Pence      -3.484e+04   2942.633    -11.839      0.000   -4.06e+04   -2.91e+04
length      -166.7139     10.873    -15.333      0.000    -188.027    -145.401
media      -2.691e+04   1857.176    -14.490      0.000   -3.06e+04   -2.33e+04
==============================================================================
Omnibus:                    11746.789   Durbin-Watson:                   1.423
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5685698.422
Skew:                           8.148   Prob(JB):                         0.00
Kurtosis:                     130.049   Cond. No.                     2.11e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The fourth model mirrors the third but includes the control variables of length and media. The R-squared improved to 0.17, so model 4 is preferred over model 3.

The positive effect of negative sentiment on favorite count can be visualized above.

Model 5: Interaction between negative sentiment and Trump tweets on retweet count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.237
Model:                            OLS   Adj. R-squared:                  0.236
Method:                 Least Squares   F-statistic:                     368.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:35   Log-Likelihood:                -91023.
No. Observations:                8317   AIC:                         1.821e+05
Df Residuals:                    8309   BIC:                         1.821e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       1.451e+04    502.684     28.869      0.000    1.35e+04    1.55e+04
negative        3575.2435   1809.547      1.976      0.048      28.079    7122.408
Trump           1.113e+04    515.496     21.585      0.000    1.01e+04    1.21e+04
negative:Trump  1.593e+04   3428.586      4.647      0.000    9212.256    2.27e+04
Harris         -5720.2414    361.835    -15.809      0.000   -6429.528   -5010.955
Pence          -6644.8885    527.363    -12.600      0.000   -7678.652   -5611.125
length           -20.8055      1.943    -10.708      0.000     -24.614     -16.997
media          -3397.9422    331.955    -10.236      0.000   -4048.658   -2747.227
==============================================================================
Omnibus:                    12482.896   Durbin-Watson:                   1.488
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         10530335.745
Skew:                           8.981   Prob(JB):                         0.00
Kurtosis:                     176.391   Cond. No.                     4.84e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The fifth model investigate the interaction effect of negative sentiment and Trump tweets on retweet count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.24, indicating 24% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on retweet count for Trump can be visualized above.

Model 6: Interaction between negative sentiment and Trump tweets on favorite count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.171
Model:                            OLS   Adj. R-squared:                  0.171
Method:                 Least Squares   F-statistic:                     245.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):               0.00
Time:                        18:28:36   Log-Likelihood:            -1.0536e+05
No. Observations:                8317   AIC:                         2.107e+05
Df Residuals:                    8309   BIC:                         2.108e+05
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept       9.381e+04   2818.073     33.287      0.000    8.83e+04    9.93e+04
negative       -4885.0701   1.01e+04     -0.482      0.630   -2.48e+04     1.5e+04
Trump            3.31e+04   2889.895     11.453      0.000    2.74e+04    3.88e+04
negative:Trump  5.951e+04   1.92e+04      3.096      0.002    2.18e+04    9.72e+04
Harris         -3.144e+04   2028.463    -15.500      0.000   -3.54e+04   -2.75e+04
Pence          -3.577e+04   2956.423    -12.098      0.000   -4.16e+04      -3e+04
length          -169.0068     10.892    -15.516      0.000    -190.358    -147.655
media           -2.65e+04   1860.958    -14.240      0.000   -3.01e+04   -2.29e+04
==============================================================================
Omnibus:                    11768.342   Durbin-Watson:                   1.424
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          5756018.104
Skew:                           8.174   Prob(JB):                         0.00
Kurtosis:                     130.838   Cond. No.                     4.84e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.84e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The sixth model investigate the interaction effect of negative sentiment and Trump tweets on favorite count. Binary variables are added for Trump, Pence, and Harris, so again the reference category is Biden tweets. The R-squared is 0.17, indicating 17% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on favorite count for Trump can be visualized above.

Model 7: Interaction between negative sentiment and Republican tweets on retweet count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          retweet_count   R-squared:                       0.129
Model:                            OLS   Adj. R-squared:                  0.129
Method:                 Least Squares   F-statistic:                     246.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          2.20e-246
Time:                        18:28:37   Log-Likelihood:                -91572.
No. Observations:                8317   AIC:                         1.832e+05
Df Residuals:                    8311   BIC:                         1.832e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            1.533e+04    488.693     31.365      0.000    1.44e+04    1.63e+04
negative             2267.6330   1974.335      1.149      0.251   -1602.557    6137.823
Republican           5560.0105    428.633     12.971      0.000    4719.783    6400.238
negative:Republican  3.444e+04   3487.129      9.878      0.000    2.76e+04    4.13e+04
length                -35.9548      2.003    -17.955      0.000     -39.880     -32.029
media               -4785.3198    348.785    -13.720      0.000   -5469.026   -4101.613
==============================================================================
Omnibus:                    11548.701   Durbin-Watson:                   1.330
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          6772282.307
Skew:                           7.780   Prob(JB):                         0.00
Kurtosis:                     141.926   Cond. No.                     4.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The seventh model investigate the interaction effect of negative sentiment and Republican tweets on retweet count. The R-squared is 0.13, indicating 13% of the variance of retweet count is explained by the model.

The greater positive effect of negative sentiment on retweet count for Republicans can be visualized above.

Model 8: Interaction between negative sentiment and Republican tweets on favorite count

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         favorite_count   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     192.6
Date:                Sun, 18 Oct 2020   Prob (F-statistic):          9.64e-195
Time:                        18:28:38   Log-Likelihood:            -1.0569e+05
No. Observations:                8317   AIC:                         2.114e+05
Df Residuals:                    8311   BIC:                         2.114e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            9.257e+04   2667.140     34.707      0.000    8.73e+04    9.78e+04
negative            -1.427e+04   1.08e+04     -1.324      0.185   -3.54e+04    6853.173
Republican           1.534e+04   2339.349      6.555      0.000    1.07e+04    1.99e+04
negative:Republican  1.403e+05    1.9e+04      7.374      0.000    1.03e+05    1.78e+05
length               -226.8463     10.929    -20.756      0.000    -248.270    -205.422
media               -3.134e+04   1903.565    -16.462      0.000   -3.51e+04   -2.76e+04
==============================================================================
Omnibus:                    11283.257   Durbin-Watson:                   1.329
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          4524473.945
Skew:                           7.587   Prob(JB):                         0.00
Kurtosis:                     116.251   Cond. No.                     4.68e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.68e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Fit

The eighth model investigate the interaction effect of negative sentiment and Republican tweets on favorite count. The R-squared is 0.10, indicating 10% of the variance of favorite count is explained by the model.

The greater positive effect of negative sentiment on favorite count for Republicans can be visualized above.

Machine Learning Models for predictive analytics

Two new varaibles are created for the interaction terms: one is negative by Trump and the other is negative by Republican.

neg_trump neg_rep
0 0.000 0.000
1 0.000 0.000
2 0.173 0.173
3 0.000 0.000
4 0.000 0.000

Predictive Model for Model 5

Because the interaction term was significant in all the models, and because the R-squared slightly increased in the models that included the interaction term, the predictive models were made that mirrored that interaction models.

LinearRegression()

Above the code to create a Machine Learning model for Predictice Analytics for Model 5.

array([13471.48241824])

A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,471 retweets.

array([17046.72594331])

A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 17,047 retweets. That is an increase of 3,576 tweets.

array([24598.25359172])

A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 24,598 retweets.

array([44106.63682023])

A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 44,106 retweets. That is an increase of 19,508 tweets, a much larger increase than for Biden.

Predictive Model for Model 6

LinearRegression()
array([85354.96269929])

A Biden tweet that is not negative, that is 50 characters long, and does not have media is projected to have 85,354 favorites.

array([80469.89261586])

A Biden tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 80,469 favorites. That is a decrease of 4,885 favorites.

array([118453.34119362])

A Trump tweet that is not negative, that is 50 characters long, and does not have media is projected to have 118,453 favorites.

array([173078.08302607])

A Trump tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 173,078 favorites. That is an increase of 54,625 favorites.

Predictive Model for Model 7

LinearRegression()
array([13530.29502519])

A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 13,530 retweets.

array([15797.92798289])

A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 15,797 retweets.

array([19090.30549902])

A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 19,090 retweets.

array([55802.67925931])

A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 55,802 retweets.

Predictive Model for Model 8

LinearRegression()
array([81227.274972])

A Biden or Harris tweet that is not negative, that is 50 characters long, and does not have media is projected to have 81,227 favorites.

array([66958.12512068])

A Biden or Harris tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 66,958 favorites.

array([96562.36954165])

A Trump or Pence tweet that is not negative, that is 50 characters long, and does not have media is projected to have 96,562 favorites.

array([222631.76330142])

A Trump or Pence tweet that is completely negative, that is 50 characters long, and does not have media is projected to have 222,631 favorites.

Lime for Model 5

[1.73e-01 1.00e+00 1.73e-01 0.00e+00 0.00e+00 1.96e+02 0.00e+00]
Intercept -5121.505631164369
Prediction_local [26287.6598739]
Right: 24935.6080733102