Data Science Report

Can We Use Twitter to Track COVID-caused Unemployment in the USA?

A Comparative Analysis of 3 Separate Datasets

Doug Rizio
Analytics Vidhya
Published in
25 min readMay 13, 2021

--

By Doug Rizio, Trishala Suryavanshi, Mohammed Yahya, and Varun Garg

COVID-19, the novel Coronavirus. First detected in late December of 2019 as a viral outbreak in Wuhan China, this mysterious new disease with pneumonia-like symptoms quickly spread throughout the rest of Mainland China and ultimately infected every major population center around the world. As of May 12th 2021, 159,319,384 confirmed cases of COVID-19 around the world have been reported to the WHO, including 3,311,780 deaths. In the United States, the country with the greatest number of total COVID cases at 32,424,637, an estimated 576,814 people have died so far.

On a global scale, the Coronavirus pandemic has left behind a legacy of international air travel restrictions, national lockdowns, the closure of businesses, and the deaths of millions. On a personal scale, it has torn friends and family members apart, ended the lives of our loved ones, and impacted much of life itself — changing everything from how we spend our time at home to how we work at our day-to-day jobs.

In fact, for many people, the two formerly separate facets of home life and work life have now completely blended together in the wake of COVID-19. As people from nearly every conceivable industry have been pushed into the practice of social distancing, personal quarantines, and work-from-home measures, the nature of work itself has been completely altered by the Coronavirus — perhaps even permanently. Yet those who were able to hold onto their jobs, even in this suspended state of isolation, were the fortunate ones. In the earliest days of the pandemic, only the most essential workers were allowed to continue working, and enormous numbers of workers in nonessential or socially-oriented occupations immediately found themselves un- or under-employed as government-mandated lockdowns and the fear of infection decimated the demand for their services.

COVID hit unemployment in the United States particularly hard. The lack of effective financial relief from the federal government ensured that countless businesses and independent contractors suffering from the effects of the Coronavirus would continue to struggle as long as the virus continued to exist. Unprecedented numbers of American companies have reduced their operations or shut down entirely, with roughly 500 US companies filed for bankruptcy in 2020. The unemployment rate in America skyrocketed at a result of these closures, at one point reaching as high as 16% — amounting to 20.5 million people in the country left without a job.

Now, with large numbers of people out of work, stuck at home, and frustrated about the state of the world, the internet has seen a tremendous surge in activity. Social media use in particular rose significantly post-Coronavirus, with some estimates in the USA measuring a 32% increase in March 2020 alone. Many people flocked to websites like YouTube, Instagram, Facebook and Twitter to talk to each other in lieu of face-to-face contact, and these social media platforms became essential tools for all matters of expression about our troubled times — acting as vital portals for public conversations about subjects such as political debates, to personal COVID updates, and complaints about work (or lack thereof) during the pandemic.

Even before the Coronavirus, social media platforms were starting to become valuable resources for data analytics on current events and popular discussions circulating around the globe. In the post COVID world, the publicly available data circulating throughout these systems is even more tremendous. Now, considering the vast amount of knowledge we increasingly have at our fingertips, an interesting question arises — how exactly can we implement the information produced by this new technology?

KEY IDEAS

The key ideas or main goals of this project are to compare multiple sources of data on COVID cases, rates of unemployment, and Twitter mentions of both subjects from specific regions of the United States, in order to answer the following three questions:

  1. Can we use social media data to track the rates of COVID-19 and unemployment as they are happening?
  2. If tweets about these issues actually correspond to rates in the real world, could we use this data to predict the rates of unemployment or COVID cases and deaths in the future?
  3. And, most importantly, can we use this data to help the victims of COVID-caused unemployment? If the American government has limited funds to provide assistance to people who are out of work as a result of the pandemic, can we analyze the intersection of Coronavirus cases, unemployment rates, and social media mentions of both subjects to evaluate which regions of the country felt the most impact? Can we inform the government where their scarce resources would best be allocated, and offset the effects of COVID-19 on job loss for those that are suffering the most?

IMPORTING LIBRARIES

As the bulk of our project was done in the development environment known as Jupyter Notebooks, the primary programming language we worked with was Python, and we had to import numerous Python libraries into the notebook before doing downloading any data. Here is a list of the libraries that we used, and what their functions are:

  • Pandas: for converting raw data into manipulable dataframes
  • JSON: for working with JavaScript Object Notation files (used in Twitter data)
  • PyDrive and Google: for working with Google tools
  • NLTK: Natural Language Toolkit, for text processing
  • Numpy: for working with arrays and forming calculations
  • Plotly, Matplotlib, and Seaborn: for visualizing data

Because our project also encompasses a few different subjects (COVID-19 cases, unemployment rates, and social media discourse about both), we had to obtain data from several separate sources.

CDC DATA

Data about COVID-19 cases over the past year was taken from the website of the Center for Disease Control and Prevention, the national public health agency of the United States. The CDC had a large dataset of COVID-19 cases and deaths on its website that was accessible through a JSON API. This dataset reports on all 50 states in America and 10 different US cities or territories (like New York City, Washington DC, Puerto Rico, and the US Virgin Islands) for every single day of 2020 since January 22.

The dataset from the CDC features roughly 24,000 rows of COVID statistics with 15 columns labeled by items such as date, state name, total cases over time, total deaths over time, new cases per day, and new deaths per day. Some of these columns also contained extraneous or incomplete information such as a second date and “probable” _VS “confirmed” _cases and deaths. Because many of the fields within these columns had null values, they were not useful for analysis, and needed to be deleted to properly plot the data. Once the dataset was extracted from the CDC through the API, it was converted into a Pandas dataframe, cleaned up and organized, and then converted into a .CSV file for more long-term storage.

CDC Data on COVID rates

BLS DATA

Our second source of data was unemployment statistics from the Bureau of Labor Statistics, or the BLS. While we could only gather COVID data from 2020 due to its spread during that year, we decided to collect US unemployment statistics from 2018 to the first few months of 2021 to see a broader perspective of how employment in the country has changed over time. If 2018 and 2019 were relatively “normal” years for unemployment, we could get a better gauge on how dramatically COVID impacted unemployment rates in 2020 by looking at the differences between the three years.

Not all of the datasets on the BLS website are available through an API, however, so we had to download this information directly from an HTML page through a Python function. This dataset also originally contained 10 years of seasonally adjusted state unemployment rates rather than the three years (and one month) that we needed, so we had to convert the data once again into a Pandas dataframe, this time dropping 7 years of unnecessary data, cleaning and organizing it, and converting it into a .CSV. Data cleaning for this set also included the use of the melt function to unpivot the dataframe from a wide format to a long format for the sake of easier visualization.

BLS Data on Unemployment Rates

TWITTER DATA

The most challenging data to collect was data from Twitter. Our initial goal was to search through Twitter’s full archive of tweets using the Python library Tweepy to filter out all tweets from the past three years, mentioning keywords related to unemployment and diseases such as COVID, tagged with specific places, and tweeted by users from the United States. To access this immense archive of social media posts, all of our group members needed to sign up for Twitter developer accounts and have them approved by the company first. However, we quickly realized that the Twitter API imposes strict rate limits upon the number of tweets that a user can request within a certain window of time, and we weren’t able to perform the searches as we intended.

Thankfully, there are other methods of obtaining large amounts of tweets, particularly when it concerns COVID-19. CrisisNLP is a website devoted to research on crisis information topics such as the Coronavirus, and it features a wealth of data on Twitter mentions of the disease — its GeoCoV19 dataset contains hundreds of millions of COVID-related tweets based on roughly 800 different hashtags linked to COVID-19. However, the data on this website came with its own set of limitations, as the tweets only ranged from February 1st to May 1st of 2020. Additionally, because Twitter’s terms of service do not allow full datasets of tweets to be distributed to third parties such as CrisisNLP, all tweets on the website are “dehydrated”, meaning that they are contained in plain text files in the form of unique tweet ID’s that act as references to the tweets in the Twitter archive rather than the tweets themselves. A desktop application called a “hydrator” was required to “rehydrate” these tweets from Twitter into full JSON data. Through the hydrator, these tweet IDs allow us to retrieve all tweet metadata, including the text of the tweet.

The Hydrator Program

Although the dataset only featured 3 months of tweets, collecting all of it was surprisingly time-consuming. First, the dehydrated tweets were all separated into .TSV files that represented all COVID-related tweets for a specific day of the month, and each file contained hundreds of thousands to several million Tweet ID’s depending on the day. After downloading all these files individually, a little bit of extra processing was required to remove unnecessary headers and columns and to convert the file into .CSV that was readable by the hydrator. And while the hydrator is its own entity, it still requires access to the Twitter API through a user’s developer account and must also handle the API rate limits that force it to pause every 15 minutes. Hydration time also dramatically increased for every successive file that we input into the program — while the file for February 1st only had around 650,000 tweets and took around 2 hours to hydrate, all the files for March and April had 1–6 million tweets that took entire days to finish processing.

Additionally, many of the tweets that the hydrator attempts to extract were deleted by the user after they were archived by CrisisNLP, and the February 1st dataset with 650,000 tweet ID’s only retrieved 400,000. Adding to the endless challenges of working with Twitter data, the JSON files created through hydration are massive, ranging anywhere from 10–30 or more gigabytes of RAM and often causing our computers to run out of space. The large size of these tweets’ correlated data is another reason why the datasets only offer dehydrated tweet ID’s instead of the real thing — a plain text file containing a series of numbers (the IDs) is much more manageable to upload to and download from a website than a JSON file with millions of tweets and all of their metadata. Likewise, the only way for us to deal with the huge sizes of the JSON files was to convert each JSON file into a much smaller .CSV and then deleting it before hydrating the next day of tweets.

CrisisNLP “GeoCoV19” Tweets
Tweet Files in Google Drive

Ultimately, we only ended up with every other day of tweets due to limitations in time and space. As a result of the enormous size of the data, we needed to use Google colab to enable all of our teammates to access it. The sequence of images shown below details the folders featuring files from February and a few lines of code used to iterate through the files.

Importing Libraries
Accessing Google Files
Iterating Through Whole Folder

DATA CLEANING AND ORGANIZATION

Once we had downloaded and converted our various sets of data into dataframes, we had to clean and organize them. This is a technique of data science known as dimensionality reduction, which can be defined by simplifying the dataset and limiting its information to only the things that are the most relevant to our use. The dimensionality reduction we performed in this case involved limiting the data to a specific timeframe and isolating only the locations of interest such as US states.

Reducing the dimensions of COVID data from the CDC and unemployment statistics from the BLS were simple tasks, as both resources focused on organized numerical information gathered by major government institutions. The only features that required removal in these datasets were locations that didn’t qualify as US states (such as Puerto Rico) and extra columns containing null values or unnecessary information.

Because the data in these two sets are defined by which state they were recorded in, we needed to reduce the Twitter data from CrisisNLP by state too. However, working with tweets proved to be more challenging. One complication we ran into was that twitter has several location parameters available for each tweet. The first one is a “coordinates” parameter that contains the exact GPS coordinates of the person who posted the tweet, such as 40.741895,-73.989308. The second one is a “place” parameter that contains the name of a specific place that the user has checked into upon posting, such as New York, New York. The third parameter is not associated with a particular tweet, but a “user location” that each Twitter user has the option of featuring in their profile. Out of the 412,239 COVID-related tweets that were extracted from February 1st of 2020, only 117 rows of data contained coordinate information, and 2991 rows featured place data, while 281,412 had some kind of user location. All 117 tweets that had coordinate data also displayed information for the place parameter, and most tweets that had place parameters also featured user locations. As a result, we felt that both the coordinate and place parameters were unnecessary for analysis, and we chose to only use the user location for identifying users from particular regions.

Although some of the user locations listed in a person’s profile are official locations offered by Twitter itself, many are actually written in directly by the user. And because users can write in their own personal locations, many of them contain not only the titles or acronyms of specific US states, but also extra data such as emojis, stop words, or the names of other countries, which we had to eliminate from the dataset.

Defining Names of US States

And while that was easy enough, the next thing we had to do was limit the user locations to only US states. Our first thought was to simply search for all tweets whose user locations contained the string “USA” to isolate tweets from America, but many locations only feature the city and the state name, and not the country code, with names like “Montgomery, Alabama.” A large quantity of users also do not use official Twitter locations in their profiles, opting for singular state names like just “Alabama.” Some users feature names like “Montgomery, AL,” with acronyms instead of full state names, and many of these names exist in both capital and lowercase form. Because of this, we needed to search for each individual state by both full name and acronym to maximize our location results.

Combining Names of US States

However, this is also not as simple as it sounds, as many of the state acronyms exist in the names of other regions. And while making a list of all foreign countries and removing them from the US state database works for some locations, the names of certain countries exist in the names of certain states, like India, Mexico, and Georgia. These can’t be filtered out as easily. Additionally, a certain number of users feature the names of multiple places as their user location, which indicates that they live in both places. Likewise, there are other users that include user locations with travel emojis that imply permanent relocation from one to the next. Some users also include flag or globe emojis for similar purposes. On top of these actual location indicators, a few users put down phrases with stop words or general terms. Searching for the most common stop words and deleting them from the database could have been a solution to this issue, but many short stop words also exist in the names of real places. Searching for certain state acronyms like LA, OR, and AR also results in vague locations which needed to be filtered out as well. Ultimately, however, our searches ended up casting a wider net than we intended however hard we tried to filter out extra data, and some of the extra cleaning we did involved the manual identification of improper locations.

Searching for US States in Tweet User Locations, for Every File in each Folder

After sorting the tweets by location, we also wanted to subdivide the tweets even further, first into the full list of tweets, and secondly into a list of tweets only containing mentions of unemployment-related terms such as “unemployment,” “work,” “jobs,” etc. It should be reiterated that the entire Twitter dataset contained tweets with mentions of COVID-19, so any subset of the main set will be related to COVID too. However, with this subsection we could still attempt to understand how much of the public discourse about the Coronavirus consisted of conversations about unemployment as well. Once this goal was attained, we could begin counting the numbers of total COVID tweets VS the numbers of only employment-related tweets for future analysis.

Printing the Count Each Tweet Per State, Per Day

VISUALIZING AND COMPARING DATA

After we cleaned and organized our data, we could start visualizing and comparing the sets.

Unemployment Rates for USA, Per Month (2018–2021)

The first one we graphed was our simplest set, the unemployment data from the BLS. Visualized below, we can see the total unemployment rate of the entire USA for all three years from 2018 to 2020, and including the first month of 2021. While unemployment rates from January 2018 to February of 2020 were on a marginal decline, March 2020 saw a slight uptick in unemployment rates compared to last year, and April 2020 skyrocketed when the Coronavirus made its impact in the USA. Fortunately, unemployment rates have been falling ever since, and by January 2021 they are much closer to their original rates.

Average Unemployment Rate per State (2018–2020)

This next graph shows the average unemployment rates per capita for all states between the three years of 2018, 2019, and 2020. We decided to include every location listed in the US in this graph, as a more interesting basis of comparison — and while Nevada and Alaska featured the highest rates of unemployment in the contiguous states, the US-related region with the total highest average unemployment rate throughout the past few years was Puerto Rico in the wake of the devastation wreaked by hurricane Maria in September 2017. Interestingly enough, however, Puerto Rico’s average unemployment rate of 10% for the past 3 years is outpaced by some of the unemployment rates of other states in the subsequent graph for 2020.

Average Unemployment Rate per State (2020)

Nevada tops the charts for unemployment once again at over 13%, whereas Hawaii is almost as bad at 12%, and California comes in at just over 10%. It’s hard to draw a solid set of conclusions based on this data alone, but one possible assumption we might be able to make is that states whose economic systems are primary based on tourism or entertainment would be most affected by lockdowns, causing many businesses to lose profits and many people to end up without work.

Setting Up Plots

Next, shown above is how we imported the Python libraries matplotlib and seaborn and then created a plot to visualize the total number of regular COVID-related tweets to the subset of COVID-related tweets featuring mentions of unemployment from February to early May. The number of tweets rises as time goes on, and while our dataset for May was incomplete, we can see that it likely continued to rise throughout the month. The large number of tweets in Georgia is also a possible error due to another country having the same name, and the same could probably be said of the numbers in Illinois, which likely resulted from its state acronym “IN” being used as a frequent term in other user locations. What isn’t surprising is that New York, California, and Texas, some of the most populous states in America, also contain the greatest number of tweets.

Total Number of COVID Tweets per State (Feb — May)
Number of COVID Tweets Related to Unemployment per State (Feb — May)

Another breakdown of the differences between our tweet totals is shown in these two bar graphs. While all COVID tweets number in the low millions, the subset of COVID tweets mentioning unemployment related terms are around 40,000. We also get a clearer picture of the rise in all COVID-related tweets over time, steadily increasing each month as COVID-19 continued to spread. And while the number of unemployment-related tweets in March was also much higher than expected given the rate at which all COVD-related tweets were being posted, the month with the highest rate of unemployment according to the BLS was April. Could it be that peoples’ fears of unemployment rose to the surface on social media just as the Coronavirus was starting to spread, before actual unemployment rates started to rise?

Total COVID Tweets (left), COVID Tweets Related to Unemployment (right)

The pie chart below is another breakdown of tweets, showing the number of tweets mentioning unemployment at 4.4% of all COVID-related tweets in the USA. Numbering a few points short of 5%, this would mean that nearly 1 out of 23 COVID-related tweets in the USA mention employment. While not enormous, it is significant, and it shows the pandemic’s tangible impact to jobs, and the conversations about unemployment online.

Total COVID Tweets VS COVID Tweets Related to Unemployment

Taking these dataset comparisons a step further, we measured all COVID-related tweets in the US to the total number of COVID cases in the top 5 states, with either the highest number of tweets or case numbers depending on the month. California has the most COVID tweets in February despite a relative absence of COVID cases. By March, however, the social media mentions in California are on the lower end of the spectrum. Meanwhile, the total number of cases in New York has far outpaced the social media mentions of the disease. By April, COVID tweets begin to rise along with the total number of COVID cases — a trend that would likely continue in May if we had access to all tweets for that month.

Total COVID Cases and Deaths VS Total COVID Tweets per State
Total COVID Cases VS Total COVID Tweets

And here is a pie chart showing the volume of all COVID Tweets VS total COVID cases. It’s surprising that the numbers are roughly equal — that means that for every person in the US who had COVID, there was at least one tweet about COVID too. However, it’s important to note that conversations about the Coronavirus on Twitter happened much earlier than when the virus actually started infecting people in the US. If we had access to more months of COVID case data, it’s very likely that these numbers would become more skewed towards case counts.

While we weren’t sure how helpful it would be in extracting more insights from our data, we also wanted to try implementing a K-Means clustering as way to explore newly acquired data science techniques. In this case, clustering was done on a combination of total COVID cases, total tweets, and unemployment tweets. Each state was clustered into a group defined by similar numbers of cases and tweets, and the algorithm we used relies on Euclidean distance as a similarity metric. This is a description of that dataset, and below we have used the Elbow method to help find the best number of clusters to apply. According to the left plot which bends at 4, the data would be distributed the best between four clusters. The graph on the right shows the four clusters with their respective case and tweet counts.

The Elbow Method (left), Cluster Centroids (right)
Cluster Analysis for Total COVID Cases and Total COVID Tweets

The plot here shows the actual distribution of clusters. Although New York is the smallest cluster, it is actually the state with the largest number of tweets and cases compared to the others which is why it stands out on its own. Meanwhile, the largest cluster features the many other states that had far fewer COVID cases and tweets by comparison. It should be mentioned, however, that K-Means Clustering centroids are highly dictated by outlier points, and if the algorithm is unable to detect anomalies in the data a state might be placed into the wrong cluster as a result. It is questionable how useful K-Means Clustering even was for this project, but it was worth trying it out.

Cluster Pie Chart (left), Clustered States (right)

A final experiment in this project was to see whether we could use machine learning processes to come up with predictions for employment rates in the future. The Training function we used is TRAINLM, a network training function that updates weight and bias values according to Levenberg-Marquardt optimization.

Based on the outcomes of the neural network predictions, we see that almost 35 states fall within a +-10% range which is almost 70% accurate. The prediction is based on 3 layers of our neural network. This is a comparable improvement in our machine learning model as compared to the interim report (with 2 layers) where there were 38 states falling within +-15% range.

As an example to verify our findings, we picked up California, one of the states hit the hardest by COVID both in terms of actual case counts and rates of unemployment. The number of “unemployment” related Tweets varied from 1,020 in February 2020, to 3,741 in March and 4,049 in April 2020. The unemployment rate as reported by BLS were 4.3%, 4.5% and 16% respectively. We figured out that there were around 3897 unemployment Tweets from ‘CA’ in May 2020. As such, we expected the actual BLS unemployment rate should drop. Through our machine learning model, we could see that the predicted rate also fell to 15.4% which proved our model’s accuracy.

That being said, our data only covered a range of a few months, which is probably not enough to produce long term predictions about statistics such as statewide unemployment rates, and we were only able to project one month into the data’s future, which was already an entire year ago. In other words, we weren’t able to make many new predictions about anything.

Machine Learning Processes

LIMITATIONS

Unfortunately, this project experienced many limitations — partly due to the unforeseen nature of certain datasets, and also due to the sheer ambition of the project goals!

Firstly, our original intention was to compare our dataset of CrisisNLP’s GeoCoV19 tweets featuring mentions of the Coronavirus to tweets featuring mentions of unemployment that we scraped from the Twitter archive ourselves. However, due to the severe restrictions of Twitter’s API on the collection of older tweets, we were forced to abandon that idea and take a subset of unemployment tweets from the COVID tweets instead. This means that we couldn’t get a true gauge of how tweets about unemployment rose over time — only whether the topic of unemployment rose within tweets about COVID-19.

Another limitation resulting from our Twitter data was the fact that pinpointing a large set of specific user locations faces many obstacles, particularly if those places are American states. While we could have chosen only to collect tweets featuring the full names of states rather than include their acronyms in our search, this would have greatly reduced the number of total tweets in our dataset, producing inaccurate results. On the other hand, including these acronyms also means that we unintentionally gather tweets with user locations having nothing to do with the state.

Working with data from social media platforms like Twitter is also very time consuming because the sets are so large. Extracting a single day of tweets can take a single day in real time, and projecting millions of tweets onto a graph from one day to the next was too much for our computers to handle. This is one reason that we opted to measure our results by month, rather than by day, despite having access to daily data for both COVID cases and tweets. And because combining two or more datasets that do not use the same parameters forces an even greater reduction in usable data, the number of visualizations we could project was restricted.

On top of everything, measuring total numbers of tweets or COVID cases per state without comparing that to per capita data is not very insightful. The project would have benefited greatly from a fourth dataset featuring the population of each state over time. However, while we did try to obtain this information towards the end of the project, at that point the size of the project expanded far beyond what we anticipated, and tracking down a daily population count for each state while hundreds or even thousands of people were dying of COVID was no easy task either.

CONCLUSION

As a result of these limitations, it was hard to measure exactly what we intended for the project. Comparing multiple datasets to each other did not produce the insights that we expected to obtain. However, we did learn several things from the project despite our setbacks.

One hypothesis we verified was that mentions about COVID-19 on Twitter rose along with actual rates of COVID-19 in the USA — although total COVID cases outpaced the social media conversations about them after the first month. We expected them to grow at similar rates, but that was simply not the case, and it demonstrates just how many people contracted the disease.

Tweets with unemployment-related terms comprised a noticeable amount of tweets with any mention of COVID-19, at nearly 5%. So while unemployment might not be the main subject on peoples’ minds when thinking about the Coronavirus, it is significant. As we already know, the pandemic had a huge impact on jobs, and some of that impact is reflected on social media.

The largest and most populous states such as New York and California had the highest numbers of COVID cases, the largest numbers of total tweets, and the largest numbers of tweets related to unemployment. This is not surprising due to their large populations — however, as we stated before, these values lose a bit of meaning without precise per capita data.

And while states like Hawaii and Nevada showed extraordinarily high rates of unemployment throughout 2020, that doesn’t mean that the disease hit them harder, at least not directly. Our original “tourism and entertainment” hypothesis can’t be verified without more research. It may be that, because so much of the populations of these economies are transient, the pandemic caused many people to leave for other states, and resulted in a lower chance of people left to spread the disease. On the other hand, as we saw in our graphs from the previous years, the unemployment rate in Nevada was already one of the highest in the country, so maybe COVID didn’t raise the unemployment rate as much as we thought. Also, upon further research into more external sources, it turns out that Hawaii had the lowest total cases of COVID per capita out of any state in America, a statistic that sounds contradictory compared to its high rate of unemployment. North Dakota also ranks as the single state hit the hardest by COVID, at least in terms of total case numbers, which is a fact that we weren’t able to observe without counting its actual population.

This just demonstrates how hard it is to measure the impact of a few factors on the health and prosperity of a whole state. As it turns out, it’s not enough to simply compare a series of numbers on a graph, especially if they’re taken from separate sources of data and represent totally different things. Even if two datasets are related to each other somehow, it takes a great deal of manipulation to make them make sense together.

So, to answer the questions we posed earlier on the article:

  1. Can we use social media data to track the rates of COVID-19 and unemployment as they are happening?
    Yes, although social media is not a total reflection of reality, internet users are not completely reliable sources of information, and it takes a lot of refinement to eliminate noise and isolate only relevant regional data.
  2. If tweets about these issues actually correspond to rates in the real world, could we use this data to predict the rates of unemployment or COVID cases and deaths in the future?
    Yes, but predicting far into the future requires more data.
  3. Can we use this data to help the greatest victims of COVID-caused unemployment?
    Maybe. This project definitely has potential, but as of now, the results are somewhat inconclusive. Again — more data is required!

REFERENCES

--

--