Chapter 3 Data transformation

3.1 Transformations for Vaccination Dataset

  1. Converted the Date column into the Date datatype. Removed MMR_week as this was of no use to our problem statement.
  2. Grouped the dataset by Location and Date.
  3. Filtered locations to remove any cumulative indicators such as rows containing ‘US’ and also removed states such as ‘District of Columbia’ which were not present in the inbuilt “state” dataset included in R. Primarily did this to avoid any merge conflicts with other datasets.

3.2 Transformations for Education Dataset

  1. Converted character data into numerical datatype for the value column by type casting.
  2. Filtered rows to not contain any cumulative indicators such as ‘United States’.
  3. Grouped by each state and allowed ‘NA’ values in the value column to be considered as 0 using na.rm = True while summarizing the data after grouping.
  4. Performed a full join operation with population dataset on the state column to further normalize the value column of the education dataset by the population of each state to get a normalized education density of each state.

3.3 Transformations for Employment Dataset

  1. Filtered dataset to not contain any cumulative indicators such as ‘United States’ since we are doing a state-wise analysis.
  2. All character columns containing numerical values were converted to numercial datatype.
  3. Another column named Total was introduced which contained the sum of the rows containing any numerical value. This column indicated the sum of historical employment data.
  4. Some values that did not give any insight in the Description column were filtered out as well. This included “Total employment (number of jobs)”, “Nonfarm employment”, “Private nonfarm employment”, “Wage and salary employment”, “Proprietors employment”, “Nonfarm proprietors employment 2/”. As these values already had more specific sub-categories, keeping these values did not provide any advantage.

3.4 Transformations for Population Dataset

  1. Multiple columns names were read into rows by the read_excel function. These rows were removed and appropriate column name was provided to each column.
  2. Only state and population columns were retained as we only needed this data to find out the density of any given parameter to compare multiple parameters efficiently.

3.5 Transformations for Crime Dataset

  1. The first few rows of the dataset did not have any data in them and so these rows were discarded.
  2. The last few rows of the dataset included description of the dataset so the same had to be removed.
  3. There were some states and column names which had a number in their superscript. We discarded the numbers from such instances and replaced such entries with their actual names.
  4. Majority of the rows did not have any state value associated with them. For instance, if one row had Alabama associated with it, all the rows following it before the name of the next state appeared had the state name missing from it. So we filled these missing values for each row with the appropriate state name.
  5. All other columns such as Aggravated Assault, Burglary etc having missing values were simply replaced with 0.
  6. Each column which had numeric values but was of character type was converted to numeric form.
  7. On grouping the data by each State, we realized that one row still had a missing State name. On closer analysis, we realized that the cities associated belonged to the state Michigan and so we replaced such missing instances with Michigan.
  8. Joined the crime data with the vaccination data on the column name “region” which denoted the states.

3.6 Transformations for Election Dataset

  1. Changed the column names to match with vaccination dataset so as to join them on the common column “region” which denoted the states.
  2. Defined a metric based on the vaccination, population and the election dataset and appended its new column in the dataset.