An Exploratory Data Analysis (EDA) on the CDC’s COVID-19 Data in the Tri-State Area

Published in

Geek Culture

16 min readFeb 6, 2021

Disclaimer: this EDA studies the total numbers of COVID-19 cases and deaths in each state, not cases and deaths per capita. Its primary purpose is to make observations on general trends in the changes of cases and deaths over time.

The Crisis of Coronavirus

Next to the presidential election, COVID-19 was undoubtedly the most defining event of 2020. The novel Coronavirus is an unprecedented pandemic on a global scale, and its ongoing impact on the world not only sets the stage for the next few years, it may also determine the fate of the entire decade. Many temporary measures being put into place to stop the spread of the virus could become permanent fixtures of society if the disease is not completely eliminated, and some of the regions most impacted by COVID may never return to normal — like large metropolitan areas whose dense populations require wide-reaching social changes in order to prevent contagion.

Introducing the Investigation

As someone who lives in such an urban sprawl — The Tri-State Area of New York, New Jersey and Connecticut — I was curious to see how the Coronavirus developed in this region over time. There were several major questions I wanted answers to through conducting this exploratory analysis:

What was the first state in the area to come down with a case of COVID?
Do states in the same region share similar developments of the disease?
Can I chart significant events by detecting dramatic changes in the data over time?
Are there any unexpected inferences that I make by analyzing this information?

I thought that the best place to start answering these questions would be the website of the Centers for Disease Control and Prevention (CDC). Monitoring the Coronavirus has become a primary objective for this federal health agency, and they had the perfect dataset for my analysis.

https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36 [1]

I wanted to take a first look at the set before I downloaded it into to Jupyter Notebooks, so I clicked on “View Data” to see a spreadsheet detailing the data.

https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36/data [2]

I already notice some issues with the data that I’ll need to address before graphing it — the date is out of order, several columns have empty rows, other columns have data that is non-numerical, and some numbers are negative. I also see the number of rows listed below: a staggering 22,740! This is an enormous dataset that most likely requires quite a lot of cleaning.

Downloading the Data

Before I can import the dataset into Jupyter Notebooks through the CDC’s JSON API, I want to import a few external Python libraries first — Pandas for data analysis, and Matplotlib for data visualization.

import pandas as pdimport matplotlib.pyplot as plt

Another important note to mention before importing the data is that there is an automatic limit to the number of fields that are read from the JSON file. Pandas only reads 1,000 rows by default, but there are nearly 23,000 rows in this dataset — so let’s set it to a comfortable limit of 25,000 (to be safe).

df = pd.read_json("https://data.cdc.gov/resource/9mfq-cb36.json?$limit=25000")

Inspecting the Information

Now that we’ve imported our dataframe, let’s use some functions to inspect it.

The info() function obtains important information about the dataframe such as the range of rows it has, a count of all its columns, the columns’ data types, and the number of non-null values contained within each.

While some of the columns contain nothing but non-null values (all 22,740 individual entries inside that column are filled with data), many of the columns have far fewer non-nulls (many of the rows are blank). Because null values will disrupt our data analysis, we need to take care of these soon.

Another useful function is describe(), which calculates a series of statistics about the dataframe.

Cleaning the Columns

Let’s get a clearer view of our dataframe so we can check out the columns and start cleaning them up.

The first obstacle I see is that “conf_cases,” “prob_cases,” “pnew_case,” “conf_death,” “prob_death,” “pnew_death,” “consent_cases,” and “consent_deaths” all contain significant amounts of nulls, identified by “NaN” (Not a Number).

What does any of this mean? According to the CDC’s website, most of these columns signify “confirmed” and “probable” counts of cases and deaths. Apparently, while many suspected cases of COVID-19 are reported because they fit the criteria for the disease, not every instance is officially affirmed as COVID-caused, and some probable case reports are withdrawn entirely…

“…if further public health investigation determines that the individual most likely did not have COVID-19.”

This occasional retraction of COVID cases also accounts for the negative numbers mentioned earlier. And that is not the only factor that complicates the monumental task of reliable COVID-19 counting. By its own admission, the CDC is unable to produce COVID data that is completely accurate, for a variety of reasons cited on its website, one of them being that they only aggregate data reported by the states and territories themselves. The accuracy of the data is dependent on the jurisdictions’ own reporting. However, this should not negate the value of the data. While the exact number of cases and deaths may be hard to discern, the numbers are valuable as close estimations.

That being said, the “probable” and “confirmed” columns in this dataset should be removed, because data with inconsistent values is going to get in the way of a proper analysis. The columns “consent_cases” and “consent_deaths” can be removed for the same reason, and while the “created_at” column seems to indicate when each case was actually admitted into the dataset, the date of creation always lags behind the original date of submission by at least a day, sometimes longer. I’m only interested in charting the initial reports because those dates more closely reflect when people contracted the coronavirus. Having two dates in the dataset is also redundant.

to_drop = ['conf_cases','prob_cases','pnew_case','conf_death',
'prob_death','pnew_death','created_at', 'consent_cases','consent_deaths',]df.drop(to_drop, inplace=True, axis=1)

Just a few fields remain — “submission_date,” “state,” “tot_cases,” “new_case,” “tot_death,” and “new_death.” Now we’re one step closer to a clean dataset — but the column “submission_date” still presents us with some problems. Extraneous numbers, a lack of chronological order, and an unaesthetic name are all issues we have to address. While we’re at it, let’s also change the names of the other columns to make them more presentable.

df = df.rename(columns={'submission_date': 'Date'})df['Date']=pd.to_datetime(df['Date'])df = df.sort_values(by=['Date'])df = df.rename(columns={'state':'State'})df = df.rename(columns={'tot_death':'Total_Deaths', 'new_death':'New_Deaths'})df = df.rename(columns={'tot_cases':'Total_Cases', 'new_case':'New_Cases'})

At last, a lovely set of clean, neatly named and properly ordered data. The last thing left to do is to filter out the states that aren’t in the Tri-State Area. Let’s make a new dataframe exclusively for New York, New Jersey and Connecticut.

dfTRI = df[(df[“State"].str.contains('NY|NJ|CT', na=False))]

One interesting detail to point out is that New York City is listed as its own state, and New York State’s reports do not include data on the city. According to the CDC, there are 60 public health jurisdictions currently reporting cases of COVID-19, including the 50 states as well as major cities such as Washington DC and NYC, and territories located outside of the contiguous USA. Because New York City is listed separately, this EDA features its data as its own region, alongside Connecticut, New Jersey and New York State.

Now that our dataset is organized, we can disable the maximum set of rows displayed and view the entire dataframe in a single window.

pd.set_option('display.max_rows', None)

The next thing we notice is that there isn’t a single case of COVID for several days — in fact, the first case of COVID-19 in the Tri-State Area isn’t listed until March 2nd, which is nearly three months after the dataset begins its records. We don’t need an entire season of zeroes here, so we’re only going to show rows with a total number of cases greater than zero.

dfTRI = dfTRI[dfTRI.Total_Cases > 0]

Pandemic Plots

Now we can easily identify the first reported cases of COVID-19 in the region:

New York City: March 2nd
New York State: March 4th
New Jersey: March 5th
Connecticut: March 8th

Let’s create another dataframe that excludes everything but state names containing “NY”, and use Matplotlib to map out our first plot — a comparison of total COVID cases in New York City and New York State.

dfNY = dfTRI[(dfTRI['State'].str.contains('NY', na=False))]
ax = plt.gca()dfNY.groupby('State').plot(kind='line', x='Date', y='Total_Cases', ax=ax)plt.title("Total Covid Cases in New York City VS New York State")
ax.legend(['NY','NYC']

Despite their division into two separate datasets, both New York’s show fairly similar trends in the growth of total COVID cases over time.

The populations of these regions are not identical, but they are in the same range — with about 8.33 million people in 2019 [3], New York City comprised nearly half the total population of New York State, which was inhabited by around 19.5 million people in the same year [4], 11 roughly million people without the city. With both a lower population and a higher total case count, it would appear that New York City fared worse in total COVID-19 case numbers than New York State in the earliest months of the pandemic. However, the two New York’s begin to meet in the middle towards the end of November, and the total number of cases in New York State outpaces that of its urban counterpart the more that time goes on. Unfortunately, without a precise measure of population in 2020, it’s hard to say whether the total case count per capita of New York actually rose above New York City’s in the past few months. What is apparent, though, is that something caused the case count trends to reverse around the start of the holiday season.

The seasonal change from Autumn to Winter was probably a major factor in rising case numbers for both locations, as people neglected social distancing in order to spend the holidays with friends and family members that they might have avoided throughout the first half of the year. Another factor in rising case numbers might be the colder weather of the winter months forcing more activities from outside to inside, where the lack of distancing space and air circulation lead to a better environment for viral transmission.

But what lead to the acceleration of total cases in the state of New York over the city? Are there any reasonable conjectures we can make? My first thought was that maybe the city maintained stricter local laws for curfews, business closures and gatherings of people, which resulted in the prevention of more total cases than in the rest of the state. Maybe people in more rural regions of the state are also less inclined to follow government mandates than people in more densely populated metropolitan areas. It bears repeating, though, that while my initial thoughts might sound like reasonable ideas in casual conversation, both of these claims end up being baseless assumptions without citing actual sources on cases per capita.

Moving on, let’s analyze the total number of cases in the entire Tri-State Area.

This plot shows that Connecticut has far fewer cases than New York and New Jersey, which share much more similar trends in case numbers. Are commuters between these two different regions responsible for transmission across state lines? Or is population simply a greater factor? While New Jersey’s population of 8.88 million [5] is comparable to that of New York State and New York City, Connecticut is much lower at 3.5 million. [6] Given this information, having lower numbers of total COVID-19 cases makes sense.

Numbers in New Jersey

Now that we’ve briefly compared the total cases in each region side-by-side, let’s analyze a single state (New Jersey) in a bit more detail, by creating a set of more individualized dataframes, and charting each column in the table.

dfNYS = dfNY[~dfNY.State.str.contains('NYC')]
dfNYC = dfNY[(dfNY['State'].str.contains('NYC', na=False))]
dfNJ = dfTRI[(dfTRI['State'].str.contains('NJ', na=False))]
dfCT = dfTRI[(dfTRI['State'].str.contains('CT', na=False))ax = plt.gca()
dfNJ.plot(kind='line', x='Date', y='Total_Cases', color='blue', ax=ax)
dfNJ.plot(kind='line', x='Date', y='New_Cases', color='green', ax=ax)
dfNJ.plot(kind='line', x='Date', y='Total_Deaths', color='black', ax=ax)
dfNJ.plot(kind='line', x='Date', y='New_Deaths', color='red', ax=ax)
plt.title('Covid Stats in New Jersey')

The most interesting aspect of this plot is how much the number of total cases absolutely dwarfs every other column, including total deaths. This certainly seems to support the claim of Coronavirus’s fairly high probability of survival relative to its rate of infection — although this statement isn’t meant to downplay the number of deaths that COVID-19 has caused, which is a tremendous tragedy whatever the mortality. Other regions show the same trends in total case numbers compared to other measurements, and the large difference between those numbers and new cases/total deaths/new deaths makes the columns difficult to analyze.

Comparing the Cases

Let’s create another series of plots by excluding total cases from each region. These graphics are intriguing because they show that, while the general shape of each line is somewhat similar from state to state, the devil is in the details — the ratio of new cases/total deaths/new deaths in each state is unique.

Every region experienced a spike in new cases that reaches a peak in April, leading to a steady upwards curve in total deaths when the pandemic first started spreading throughout the country. For the most part, this number began evening out for the next few months, and the growth in news cases and deaths stops almost completely from July to October. I think it’s safe to say that the slowdown was probably due to a combination of quarantines, social distancing measures, and an increase in activities that could be safely done outside. However, starting in November, new COVID cases and deaths began to rise again at an exponential rate, reaching another zenith in January of this year, and leading to another escalation in total deaths that continues to rise.

The similarities end there, though— these plots feature major differences that are difficult to explain. For example, the exact ratio of new case numbers to total deaths varies wildly in every graphic. One thing that sets Connecticut’s plot apart from the others is a dramatic swell in new cases that surpasses the total number of deaths it experiences in the wintertime, whereas the number of new cases in New York State during the same season lies just below total deaths. New York City and New Jersey have the lowest ratio of new cases to new deaths — except for when New Jersey experiences an unexpected surge in new cases in early 2021, with a sharp retraction that drops down to nearly -10,000. What could be the cause of such fluctuations?

Another particular point in the plots that causes me to pause is a spike and retraction in new COVID cases in New York State, this time in early April. Either an enormous number of people received positive test results that were immediately taken back days later, or there are some major errors in my data.

A third thought — maybe the COVID-monitoring computer systems in these states were so overloaded by new case numbers that there were inevitable errors unrelated to the validity of virus test results, and the negative numbers were released to resolve the errors after they were detected.

Let’s isolate new COVID cases from the other statistics and compare the states.

One last questionable series of spikes and retractions is present in New York City’s new cases of COVID-19 from mid-May to mid-July. This erratic pattern is particularly confusing because it’s not related to any expected seasonal spikes or retractions in new cases, unlike the ones that are present in New York State and New Jersey. The only explanation I can tentatively propose is that, as more people left their houses in the warmer weather, more people became concerned that they were exposed to Coronavirus outside, and more false cases were reported. Yet I also can’t make this claim with certainty. If this development was the underlying cause, then why was it only in NYC? My analysis is inconclusive without an explanation from an outside source.

Death Data

Our final analysis of the dataset focuses on total and new deaths in the area.

The most striking feature in both of these graphs is the high number of both total and new deaths in New York City compared to the states around it, even while the other regions have comparable populations. The city has a total death count of over 25,000 and experienced an enormous spike in new deaths during April, with a maximum count of over 4,000 new deaths in a single day.

New York state also suffered a heavy loss of life in early April, although the number of new deaths only reaches around 2,000 — and there is just as great of a retraction in new deaths shortly afterwards. How can we explain such a sharp reversal of new death numbers? Could it indicate misunderstandings about COVID symptoms early on in the pandemic’s development? Maybe people overestimated its rate of mortality. Because COVID-19 shares so many symptoms with other diseases, it was probably easy to mistake unrelated deaths as COVID-caused. However, this sharp retraction in death numbers could also be the result of an overloading the system, as I mentioned earlier.

Both New York City and the state of New Jersey show another sharp spike in new deaths at the very end of June, with New Jersey’s spike in new deaths nearly triple the size of the new deaths in NYC. Could this sudden swell of new deaths be related to the end of school and the start of summer vacation for students and teachers? Also noteworthy is a minor spike and retraction right before October, one that I have a harder time trying to explain.

In Connecticut, the numbers of new deaths are so low that it’s hard to see on a graph compared to the other regions of the Tri-State Area. We should plot out one more pair of datasets to take a closer look at Connecticut’s new deaths (the lowest in the area) and compare that to the new deaths in New York City (the highest). The actual death numbers themselves are going to be very different between the two regions, but we might be able to obtain some insights by comparing the trends in new deaths over time.

While the number of new deaths in New York City is extremely high in April, new deaths later in the year are completely absent. On the other hand, Connecticut’s new deaths are low in general — but the state experiences a large second wave in January that is nearly as big as its first one.

What could be the cause? It may be that, because death by COVID-19 seems to be less of a risk in Connecticut, the people there took less precautions to prevent its spread, and when the second wave arrived at the beginning of 2021, the few people who did end up dying from it were unprepared. Meanwhile, I imagine New Yorkers taking Coronavirus very seriously after suffering so many losses in April, and maybe they did a better job at preventing a second wave of deaths because they already knew the risks.

Again, though, this is only my own personal speculation — I can’t know for sure what the reasons are without external confirmation.

Coronavirus Conclusions

This EDA of COVID-19 in the Tri-State Area has taught me a great deal about how to extract information from large datasets, in addition to the long-term developments of the Coronavirus itself. I gained several insights about the region I live in, as well. However, I also conclude this project with even more questions — like what further revelations might be gained from this analysis with more context, such as news stories or per capita data? Or, how can we use this information to prevent the spread of more diseases like COVID?

One major limitation I had in analyzing this data was my lack of familiarity with Pandas and Matplotlib. If I had more experience using these libraries then I would be able to create more plots to display my data in interesting ways, and make new calculations on existing data to reveal novel insights.

Whatever my limitations were prior to this project, though, I know more than I did when I started. This was a great introduction to the world of web mining.