COVID Tweets — Finding Similar Twitter Users in the First Days of the Pandemic

Published in

Analytics Vidhya

17 min readApr 21, 2021

Introduction

Even in the earliest days of the Coronavirus pandemic, many people took to social media platforms like Twitter to share their thoughts about the deadly disease. Questions about COVID’s spread in China, concerns about its rate of mortality, debates about government responses, and conspiracy theories about its origins were just a few of the many related topics talked about online. Just like with any popular subject, users often found themselves at odds with each other, arguing in favor of their own opinions against the opposing side, and leading public discourse throughout the country.

Given the increasing political polarization in the US, many opinions about the Coronavirus are also tied to political ideologies — Democrats are more likely to promote masks than their Republican counterparts [1], whereas conservatives were more likely to criticize China as the cause of virus than liberals were [2], and left-wingers in America tend to be much more in favor of government pandemic mandates than right-wingers are. [3]

So, if a single opinion can be tied to an entire political party, what then does that say about the person who has that opinion? Can we ascertain someone’s political stance based on a few words they say online? If so, can we find similar users who might also share those same opinions? And, if we can, how might that information be used? A political recommender system that can finds similar users on social media could be valuable to a wide variety of individuals and organizations— from journalists, activists and campaign workers to government departments, intelligence agencies and big businesses. While the ethics and ultimate uses of targeted political advertising by some of these institutions are questionable, it’s interesting to know how it might work in practice.

Collaborative Filtering VS Content Filtering

Collaborative VS Content-Based Filtering

In data science, there are two main ways of filtering data for recommendation systems — content-based filtering VS collaborative filtering. In content-based filtering, multiple items (such as movies, TV shows, books, or any other type of content) are compared to each other and ranked on similarity based on the qualities that they share. In the case of popular media, these qualities can be anything from the title, plot summary, genre, author, or year of creation. If the items share many qualities in common, then they are ranked as more similar than items that do not. Collaborative filtering, on the other hand, measures the similarity of multiple users based on the items that they have in common. Here, we do not need to know the particular qualities of the items, only that two users both react to enough of the same items in the same way — such as consuming the same content or giving the same content the same ratings. [4]

Because this project will be finding similar users on Twitter, it will rely on collaborative filtering — measuring the similarity of users based on the items that they share, with the hashtags, tweet text, user descriptions and user locations acting as items. In a way, this is almost a cross between content-based and collaborative filtering, because half of the items in the dataset are actually the qualities of the users themselves (user descriptions and user locations).

Downloading the Data

To get large amounts of old twitter data from the earliest days of the pandemic (let’s pick the date of February 1st, when the virus was just starting to spread out of Wuhan), we need a Twitter developer account. However, because Twitter imposes severe rate limits on the number of old tweets that a user can request each month, we can’t rely on an API like Tweepy — we just won’t get enough data that way. So, instead of obtaining the data ourselves, we can use pre-existing datasets saved by other users. One of the best sources for COVID-related tweets in early 2020 is GeoCOV19 by Crisis NLP. [5] The website has an archive of “dehydrated” tweets in the form of simple tweet IDs which can be fed into a “hydrator” program that “rehydrates” the entire tweet as JSON information. [6] Once we download the zip file, extract the TSV file, and resave it as a CSV with some minor modifications, we are ready to hydrate.

Converting the TSV to a CSV for the Hydrator

After we download the “hydrator” we can input our CSV and let it run. While February 1st contains over 600,000 tweet IDs, many tweets were deleted since then, and we end up with around 400,000 tweets instead. Because the rate limits still apply, hydrating one day of tweets takes two straight hours with the resulting JSON file taking up several GB. However, we can convert this file back into a smaller CSV and then open it in our Notebook as two separate Pandas dataframes — one that we want to modify and one that we want to save for reference.

Reading the CSV as a Pandas Dataframe

Cleaning the Data

There are 34 columns but many of them are unnecessary, and because our dataset is already so large it would help to remove as much extra information as we can. The most important fields here are the hashtags, the text of the tweet, the screen name of the user, the user description, and the user location. While coordinates, place, and time_zone all seem like important items, very few users feature these parameters in their tweets. The rest of the fields are not useful, interesting, or feature primarily null values, so they are also removable.

Dropping All Unnecessary Columns

Information about the Dataframe — After Drop

Now we have isolated our 5 most important columns, but we can still see some undesirable features in the dataset — it’s full of null values, indicated by “NaN” (not a number), and contains many locations outside of the USA. Let’s start cleaning it up.

(Note: all user screen names will be blurred out to protect their identities)

One thing we can do is make everything lowercase for easier identification.

Making Every Column Lowercase

We can also replace the names of important cities like “Washington, DC” (the US capital) and “New York City” (the largest city in the US) with a single term, and give the other two-term US states a single word for easier searches.

Converting Two-Term States to Single Word

Now we can include the city and state names in a larger list that we want to isolate from the rest of the dataset. While including the state acronyms would lead to more results, it would also lead to more falsely identified places — for example, “MT” for Montana could refer to a place with a mountain in its name. “IN” for Indiana and “OR” for Oregon could be stop words used in any location. “DE” for Delaware, “LA” for Louisiana, and “MI” for Michigan are common Spanish language terms. We want to avoid these situations, and if we don’t include these state acronyms, then we shouldn’t include any other state acronyms either (with DC and NYC being the sole exceptions because of their importance as places). We should also eliminate every mention of “USA” from our location so the country doesn’t show up on our most common terms.

Filtering Dataframe by State Name

Information about the Dataframe — US States Only

Now that we’ve limited our locations, we only have about 20,000 columns out of 400,000. But there are more issues to address. All fields must contain data, so we must drop any row with a null value. We should also remove special characters, punctuation marks, and stop words, and the rest of those state acronyms from our fields to simplify our search results.

Removing Extra Data from Columns

Information About the Dataframe — Final Column Count

After that, the entire set is a fraction of its former size, but its columns are consistently populated by only the most relevant information. At this point we can move onto our first step of analysis: counting the most common terms.

Graphing the Most Common Terms

Here we import CountVectorizer from Scikit-Learn, and Pyplot from Matplotlib. [7] The CountVectorizer enables us to “tokenize” a series of text documents to create a vocabulary of known words, and “tokenization” in this case is the act of breaking up a string into a series of individual elements (or “tokens”) which can then be parsed independently. [8] Pyplot, on the other hand, lets us plot graphs of data.

Plotting Graphs of the Most Common Terms

First we set the range of terms to find, then we set the type of term (or column) to find from, and then we set the graph color. We create the CountVectorizer object which takes the range as its input, the top_terms object which runs the CountVectorizer on our term type, and the top20 object which sorts the top 20 terms by descending order. Then we plot the graph. We can also change the range to 2 and 3 to see the most common two- or three-word phrases.

Because this Twitter dataset is based on COVID-related tweets, “coronavirus” is the most frequently used hashtag featured in almost 3,500 out of 4,386 tweets. The next top hashtags are “china” and “wuhan,” which reflect the news about COVID’s origins in the earliest days of the pandemic. Almost everything else after that is about what you would expect. Two-term hashtag sets are less common but contain the same ideas, and three-term hashtag sets are even less common.

Top 20 Hashtags (Two Term s— Left, Three Terms — Right)

Tweet with Common Text Terms “Train”, “Italy”, “Woman”, “Comments Loudly”

Searching for most common terms by text produces similar results, but “train” and “Italy” are unexpected. Looking up the actual tweet demonstrates how a single, widely-shared tweet from a single day of Twitter can widely skew our search results.

Top 20 Text Terms (One Term — Left, Two Terms — Right)

Searching for user descriptions produces more general terms, with “love,” “news,” and “health” at the top of the charts. Many of the user descriptions feature right-learning terms such as “maga,” “trump,” “god,” or “conservative.” This confirms Twitter’s use as a highly political social media platform, particularly for republican activists in the year of the last presidential election. The most common two-term sets in the user descriptions are a bit more insightful: with New Yorkers, public health enthusiasts, family-oriented individuals and Trump fans showing the highest turnouts (at least in terms of people who tweet about the Coronavirus). Or are these just the most active types of users on Twitter in general? It’s hard to tell.

Top 20 User Description Terms (One Term — Left, Two Terms — Right)

Lastly we search for the top 20 user locations. Considering that California, Texas, Florida, and New York are the most populous states, it is no surprise that so many Twitter users tweet from these locations. Washington DC is also a hotspot for active social media users — and if New York City was combined with New York, the East Coast state would dominate the list.

Now that we have identified the most commonly used hashtags, text terms, user descriptions and user locations, it’s time to search for similar content in the dataset. However, let’s first tie our new dataset to our old dataset. By isolating all of the tweets in our old dataset that share the same index numbers as the tweets in our new dataset, we can make them one-to-one for easier comparison. Then we can reset their indices so that they both start from zero, which helps us to create cleaner map of indices for our similar content search.

Reset Indices of Both Old and New Dataframes

TF-IDF

In our previous searches, we found the most common terms by simply counting them with the CountVectorizer. However, while simply finding the most common terms between two different sets of data might sound like a decent way to search for similar content, a problem arises: as the size of two documents both increase, the number of common words they share tends to increase too, even if they are not talking about the same subjects. So, a basic count of shared words is not enough to find truly similar content.

Because of this problem we end up using the TfidVectorizer instead, which makes use of the TF-IDF algorithm. TF stands for “term frequency,” which is the rate at which a word occurs in a single document. IDF stands for “inverse document frequency,” which is the measure of how common a word is in a document corpus (or body of works), and is taken by dividing the total number of documents by the number of documents that contain a particular word. When TF and IDF are multiplied together, we get a value that reflects how important a word is with respect to a particular document in a corpus of documents. Here we set our parameter to “hashtags.” Each hashtag is vectorized, a matrix of vectors is created, and the shape of the matrix is revealed. [9]

TF-IDF Vectorizer

Cosine Similarity

While there are multiple means of computing similarity, we can use Cosine Similarity to calculate a numeric quality or similarity score that denotes the similarity of one user to another based on mathematical distance.

The math is complex, but it involves projecting vectors into a multi-dimensional space and calculating the cosine of the angle between two different vectors. The smaller the angle, the higher the similarity. While it may be hard to grasp on its own, it makes more sense when we visualize it on a graph and compare it to another metric. Here is a representation of Cosine Similarity VS Euclidean Distance. [10]

Cosine Similarity is said to be advantageous over Euclidean distance as a measurement of similarity because it ignores the weight and magnitude of the vectors and focuses purely on the angle.

An example of this would be measuring the similarity of three different items:

Item A: 1 carton of eggs, 1 jug of milk, 1 loaf of bread
Item B: 100 cartons of eggs, 100 jugs of milk, 100 loaves of bread
Item C: 1 carton of eggs, 1 bottle of wine, 1 bottle of beer

Euclidean Distance would rate Items A and C as similar, whereas Cosine Similarity would rate Items A and B as similar.

Anyway, we can use TF-IDF, Cosine Similarity, and Linear Kernel (identifying linearly separable data) together with each other to obtain these scores.

Cosine Similarity

Searching for Similar Users

The last thing we do before creating the search is make a reverse map of indices to find the index of a user by entering their user name into the field.

Reverse Map of Indices

At long last, we can create our program to find similar users in the dataset.

Find Similar Users

Since one of the most commonly used hashtags was “China,” let’s find the user with the longest set of hashtags containing the name of the country so that we have a large set of terms to compare against. We can also display it next to the original dataset to understand what the user was saying at the time. Then we can find users whose use of hashtags is similar to this one.

Finding User with Longest Series of Hashtags Featuring “China”

Full Tweet of User #3099 — with Longest Series of Hashtags Featuring “China”

Interestingly enough, while “China” was the initial hashtag of interest here, most of the users in this selection are more concerned about health than anything else, since so many of the other hashtags in the original tweet (#Health, #Healthcare, #MHealth, #DigitalHealth) are health-related. However, the whole Twitter dataset ensures that a vast majority of users use the Coronavirus hashtag, so it’s no surprise that many users would have some mention of health. Hashtags in general are also fairly sparsely used compared to actual tweet text — because most users only use two to four hashtags in each tweet, searching for users based on hashtag use is probably not enough to truly measure similarity. Let’s try searching text terms instead, using similar methods as above, this time using the third most common term “Wuhan.”

Finding User with Longest Series of Tweet Texts Featuring “Wuhan”

Full Tweet of User #574 — with Longest Series of Tweet Texts Featuring “Wuhan”

In this case, there is probably too much tweet text to base a similarity search on — the user posted about not only Wuhan but a myriad of other subjects such as Islam, Brexit, a Dutch politician, Jeff Bezos, and different mentions of Saturday. As a result, the search produces a similar assortment of dissimilar posts. Maybe searching by user description could lead to some more interesting results. And since “Trump” was such a common political term in so many user descriptions, let’s find the user with the longest user description containing the former president’s name.

Finding User with Longest User Description Featuring “Trump”

Full Tweet of User #1868 — with Longest User Description Featuring “Trump”

This proves to be more useful in finding users that are likely to share similarities with each other, as the things that people say about themselves in their user descriptions are often much more personal than the things that they say to the public (or to each other). However, one intriguing twist here is that there are almost as many anti-Trump users as there are pro-Trump users, and those who are “pro-” or “anti-” other things — for example, while one user is “pro-gun”, another user is “anti-gun.” One thing this similarity search can’t do is sentiment analysis. Even overt references to political concepts can’t easily reveal a person’s actual political beliefs without considering the context in which those concepts are used. It should be noted, though, that the top pro-trump users are from Florida, Alaska, Michigan, and Florida again, each states that voted for Trump in the 2020 presidential election.

The fourth search we can do is for user location. Let’s try Texas, which is both the second largest state by population and a big bastion for Trump voters.

Finding User with Longest User Location Featuring “Texas”

Full Tweet of User #2137 — with Longest User Location Featuring “Texas”

While almost everyone is from Dallas and tweeting about the Coronavirus, half of the users mention something about China, and their user descriptions are fairly diverse. It’s hard to say whether these users are truly similar based on this information alone.

For our fifth and final search, let’s do something completely new — combining the hashtags, text terms, user descriptions and user locations of each user into a single “soup” of content and then comparing those soups to each other. Our user of choice is also random.

Creating a “Soup” of Hashtags, Text, User Descriptions and User Locations

The more metrics we add to the search (hashtags, tweet text, user descriptions, user locations) the more random the users actually seem to be. While we can see some mentions of drones, dads, law, and IOT (the internet of things), we have no way of measuring which words in the soup of text are more important than others (and none of the user locations are similar to the original either).

Limitations

The most obvious limitation is that this dataset is completely focused on tweets about the Coronavirus, which restricts the scope of the data. It also only features a single day of tweets early in the pandemic’s spread, and each user only has one unique tweet to represent themselves. It’s not a large dataset, and it became even smaller when eliminating columns containing null values — many users don’t use hashtags or user locations. On top of that, searching for locations based solely on the whole name of the state and excluding the acronym from the results shrinks the size of the data even further. This dataset is not a comprehensive reflection of all Twitter users, or even the users that ARE in this dataset. The study does not analyze user statistics such as total tweets, followers, friends, shares, or comments, either. And while the best results came from searching by user descriptions, without performing sentiment analysis on them, it’s hard to draw many meaningful insights from them — two people can both mention the exact same things, but if they hold completely separate opinions about those things, then are the users truly similar?

Conclusion

The vast majority of tutorials on finding similar content online are based on recommender systems that filter content such as movies or TV shows, or perform collaborative filtering on the unique users who enjoy those movies or TV shows. Therefore, basing this political similarity search on Twitter users felt like the best option for the project, because it was the closest equivalent to the metric used in those other examples. However, it might be more useful to base future social media similarity searches on user locations rather than users themselves. Finding similar locations and measuring the most common terms used there could do a better job to help us understand what, in general, the users living in those locations are like, what political topics they tend to talk about, and whether a majority of those users have particular opinions about COVID. For those who are interested in peoples’ political beliefs, regional targeting could be just as useful as personal targeting. However, in order to do this type of search, the structure of the dataframe would have to go through some fundamental alterations. As far as I know, this location-based dataframe could not have any duplicate user locations (unlike the dataframe in this project), and each user location would have to contain the totality of the information relating to it — in other words, every column of data for each location would have to merge all of its respective hashtags, tweet texts, user descriptions, and user names into a single soup of words in order to be analyzed. Only then could the user locations be properly compared to each other to find similarities between them. At least, that’s what I imagine it would take to do such a similarity search. But that is a project for another time.