Chapter 2 Data sources
We have used 6 different datsets in this project. Since we wanted to see how different things correlate with each other (e.g. vaccination status and education level), therefore, we chose multiple datasets to get multiple factors into consideration before concluding a result. All datasets have been downloaded from their original data sources.
2.1 Vaccination Dataset
Link: CDC Vaccination Dataset
Overall Description: This dataset has been downloaded from the Centers for Disease Control and Prevention (CDC) website. It contains details of the vaccination for each state over a period of time. It also contains additional details regarding the number of different kind of vaccines administered and distributed both state-wise and age-wise. In total, this dataset has 69 columns with 21528 observations.
Name of Columns: Data, MMWR_week, Location, Distributed, Distributed_Janssen, Distributed_Moderna, Distributed_Pfizer, Distributed_Unk_Manuf, Dist_Per_100K, Distributed_Per_100k_12Plus, Distributed_Per_100k_18Plus, Distributed_Per_100k_65Plus, Administered, Administered_12Plus, Administered_18Plus, Administered_65Plus, Administered_Janssen, Administered_Moderna, Administered_Pfizer, Administered_Unk_Manuf, Admin_Per_100K, Admin_Per_100k_12Plus, Admin_Per_100k_18Plus, Admin_Per_100k_65Plus, Recip_Administered, Administered_Dose1_Recip, Administered_Dose1_Pop_Pct, Administered_Dose1_Recip_12Plus, Administered_Dose1_Recip_12PlusPop_Pct, Administered_Dose1_Recip_18Plus, Administered_Dose1_Recip_18PlusPop_Pct, Administered_Dose1_Recip_65Plus, Administered_Dose1_Recip_65PlusPop_Pct, Series_Complete_Yes, Series_Complete_Pop_Pct, Series_Complete_12Plus, Series_Complete_12PlusPop_Pct, Series_Complete_18Plus, Series_Complete_18PlusPop_Pct, Series_Complete_65Plus, Series_Complete_65PlusPop_Pct, Series_Complete_Janssen, Series_Complete_Moderna, Series_Complete_Pfizer, Series_Complete_Unk_Manuf, Series_Complete_Janssen_12Plus, Series_Complete_Moderna_12Plus, Series_Complete_Pfizer_12Plus, Series_Complete_Unk_Manuf_12Plus, Series_Complete_Janssen_18Plus, Series_Complete_Moderna_18Plus, Series_Complete_Pfizer_18Plus, Series_Complete_Unk_Manuf_18Plus, Series_Complete_Janssen_65Plus, Series_Complete_Moderna_65Plus, Series_Complete_Pfizer_65Plus, Series_Complete_Unk_Manuf_65Plus, Additional_Doses, Additional_Doses_Vax_Pct, Additional_Doses_18Plus, Additional_Doses_18Plus_Vax_Pct, Additional_Doses_50Plus, Additional_Doses_50Plus_Vax_Pct, Additional_Doses_65Plus, Additional_Doses_65Plus_Vax_Pct, Additional_Doses_Moderna, Additional_Doses_Pfizer, Additional_Doses_Janssen, Additional_Doses_Unk_Manuf.
Column Description: This dataset contains all the numerical columns except for Date
and Location
column which contains date type and categorical variable respectively.
2.2 Employment Dataset
Link: Statewise GDP - Annual (BEA) Dataset
Overall Description: This dataset has been downloaded from the Bureau of Economic Analysis (US Department of Commerce) and contains the historical Employment statistics of the US from 1998 to 2020. This will help us in correlating the overall employment trend of each state with various other datasets. It contains a total of 31 columns with 7084 observations.
Name of Columns: GeoFIPS, GeoName, Region, TableName, LineCode, IndustryClassification, Description, Unit, X1998, X1999, X2000, X2001, X2002, X2003, X2004, X2005, X2006, X2007, X2008, X2009, X2010, X2011, X2012, X2013, X2014, X2015, X2016, X2017, X2018, X2019, X2020.
Column Description: The columns X1998 to X2020
, GeoFIPS
, LineCode
contain numerical values while the columns GeoName
, TableName
, IndustryClassification
, Descripition
contain categorical values.
2.3 Education Dataset
Link: Education (Department of Education) Dataset
Overall Description: Downloaded from the US Department of Education website, this dataset contains the different type of students and their count in number for each state in which they received their education. It also contains the type of program they were enrolled in along with different sub-categories of each program type. It contains 12 columns with 513721 observations.
Name of Columns: School.Year, State, NCES.LEA.ID, LEA, Data.Group, Data.Description, Value, Population, Category, Sub.Category, Grade.Level, Program.Type.
Column Description: The only numerical column is the value
column, rest all other are categorical in nature.
2.4 Population Dataset
Link: US Census Dataset
Overall Description: This dataset is downloaded from US Census Bureau website. It contains only two columns namely State
and Population (April 1, 2021)
. It depicts the statewide population distribution as of April 1, 2021. This dataset contains a total of 2 columns and 50 observations.
Name of Columns: State, Population (April 1, 2021).
Column Description: One numerical column Population (April 1, 2021)
along with one categorical column namely State
.
2.5 Crime Dataset
Link: Crime Dataset
Overall Description: This dataset is downloaded from FBI Crime Data Explorer. It contains statewide and citywide distributions for different type of crimes that were reported. This dataset contains a total of 13 columns and 7689 observations.
Name of Columns: State, City, Population, Violent Crime Total, Murder and nonnegligent manslaughter, Rape1, Robbery, Aggravated assault, Property Crime Total, Burglary, Larceny-theft, Motor vehicle theft, Arson2.
Column Description: It contains all numerical columns except State
and City
which contain different cities for each state.
2.6 Election Dataset
Link: Local Elected State Party Dataset
Overall Description: The dataset found above is restricted and can be accessed only though certain credentials. After going through the description of the dataset, we found the original source (https://www.nga.org/bestpractices/homeland-security/) from where the dataset was curated. We then manually created the dataset by looking up the governor for every state and extracting the necessary information required for analysis purposes.
Name of Columns: Office, Name, Party, Date assumed office.
Column Description: Office
column describes the state, Name
gives the details about the name of the governor, Party
states the political party to which the governor belongs to and Date assumed office
states the date on which the governor was elected.