Cleaning Messy Data¶

Raw data doesn’t exist¶

A dataset, at best, represents a process under study. However, information is always lost and biases are often introduced between the occurrence of the process and recording the observations in a dataset. A person must decide what relevant measurements to make and implement the instrumentation. These human decisions create a gap between the process under study and the observations being used to study it.

The following terms, previously discussed in relation to the scientific method, describe the core objects of study in data science:

A data generating process (DGP) is the underlying, real-world mechanism that generates the observed data.
Observed data (creating a dataset) are incomplete artifacts of the data generating process.
A (probability) model is a simplified explanation of the data generating process. The process described using a model is compared to the observed data to judge the quality of the explanation.

Example: …

To make life more difficult, data scientists often find themselves working with already existing datasets created for purposes different than their current interest. In these cases, special care must be taken to understand both what process the dataset represents and how well it represents it. As such, the data scientist should be an ever-skeptical detective attempting to describe the story of how the dataset came to be, including:

a rough description of the assumptions on the underlying data generating process,
a detailed description of how the DGP came to populate each field in the dataset (the measurements),
a detailed description of any data processing and/or storage decisions that affected the values in the dataset.

These descriptions together form the data’s provenance.

“Messy data” are deviations from the process being modeled that are not due to randomness. Armed with the provenance for a dataset, the data scientist “cleans” the messy data to best reflect the data generating process.

Data Cleaning¶

Definition: Data cleaning is the process of transforming data into a faithful representation of an underlying data generating process to facilitate subsequent analysis.

A common procedure for data cleaning often involves:

Fixing data types so columns reflect the kind of data in each field,
Resolving ambiguities and inconsistencies in the data,
Identifying corrupt or unlikely values for removal, fixing, or declaration as missing.

Each of these procedures will require applying an arsenal of new Pandas methods to the columns of an existing dataset, while storing the result in a cleaned dataset.

Data types¶

Datasets are often designed for human consumption, rather than for computation. Thus, a data scientist attempting to simply describe the contents of a dataset must first fix data-type errors. Whenever possible, one should leverage existing dataset documentation (such as a data dictionary or a database schema) to set the data-types. In the absence of such resources, one must infer the correct types by reasoning about the process that generated the data.

Cleaning the data types of the columns of a dataset involves resolving the data types present in the columns of a dataset with the kind of data represented in each column.

Such cleaning is illustrated in the example below.

Example: The observations contained in the dataset below consists of each campus of the University of California. These observations derive from the College Scorecard Dataset. Each observation consists of the follow attributes:

COLLEGE_ID: unique ID for each college
NAME: name of the institution
UNDERGRAD_POP: the size of each university’s undergraduate population
PCT_FED_LOAN: the percentage of students at the university who took out federal student loans
MEDIAN_GRAD_DEBT: the median debt that students hold upon graduation
SETTING: the type of environment in which the university is located

uc = pd.read_csv('data/ucschools.csv')
uc

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
0	131200.0	University of California-Berkeley	30574.0	23.73%	$13200	City
1	131300.0	University of California-Davis	30046.0	36.0%	$14000	City
2	131400.0	University of California-Irvine	29295.0	36.4%	$16500	Suburban
...	...	...	...	...	...	...
6	132000.0	University of California-Santa Barbara	22181.0	36.32%	$16300	Suburban
7	132100.0	University of California-Santa Cruz	17577.0	46.41%	$19700	Suburban
8	4127100.0	University of California-Merced	7375.0	50.13%	$19018	Rural

9 rows × 6 columns

Any attempts to perform an exploratory data analysis on the columns of the uc dataset would quickly produce type-errors (verify this yourself!). To assess each field of the uc dataset, Pandas provides a DataFrame property .dtypes that returns a series with the data type of each column.

The table uc contain columns of float and object data-types. Recall that object columns contain mixed data-types and are usually indicative of the presence of string values. Determine which data-types in the table match the data-types implied by the kind of attribute each column contains.

uc.dtypes

COLLEGE_ID          float64
NAME                 object
UNDERGRAD_POP       float64
PCT_FED_LOAN         object
MEDIAN_GRAD_DEBT     object
SETTING              object
dtype: object

COLLEGE_ID Column: This column is a unique identifier for each university. As there doesn’t seem to be a meaningful order to the column, it is likely nominal. Since floating-point numbers are subject to rounding errors, representing COLLEGE_ID as a float might change the unique identifier for an observation. Thus, such column should be represented as either an integer-type or string-type. While either choice is faithful to the meaning of the field, storing the values as int-types is more space-efficient.

Remark: Why are the values float-type in the first place? Data processing programs often incorrectly infer data-types. If a column is purely numeric, the program will assume the column is quantitative in nature.

Transform the type of the column using the astype Series method:

uc['COLLEGE_ID'].astype(int)

   131200
   131300
   131400
      ...   
   132000
   132100
  4127100
Name: COLLEGE_ID, Length: 9, dtype: int64

PCT_FED_LOAN Column: This column represents a quantitative column, yet the column in Pandas is of object type. One can confirm these values are strings by inspecting one of the entries:

ucb_loans = uc.loc[0, 'PCT_FED_LOAN']
ucb_loans, type(ucb_loans)

('23.73%', str)

Because of the presence of the % symbol, Pandas doesn’t immediately know how to coerce the values of PCT_FED_LOAN to float-type (verify this!). The column requires cleaning first: strip off the %, then coerce the remaining number to a float-type. To strip off the % characters, use the string methods Pandas provides within its str name-space. The cleaning steps are as follows:

strip the % character off the end of each string using Series.str.strip,
coerce the remain string to a float using astype,
convert the percentage (between 0-100) to a proportion (between 0-1).

(
    uc['PCT_FED_LOAN']
    .str.strip('%')
    .astype(float)
    / 100
)

  0.2373
  0.3600
  0.3640
      ...  
  0.3632
  0.4641
  0.5013
Name: PCT_FED_LOAN, Length: 9, dtype: float64

MEDIAN_GRAD_DEBT Column: Similarly, the median debt at graduation is a quantitative column being represent as a string-type because of the presence of the $ character. A similar process translates these formatted strings into float-types:

(
    uc['MEDIAN_GRAD_DEBT']
    .str.strip('$')
    .astype(float)
)

  13200.0
  14000.0
  16500.0
      ...   
  16300.0
  19700.0
  19018.0
Name: MEDIAN_GRAD_DEBT, Length: 9, dtype: float64

SETTING Column: This column is ordinal, as there is an ordering of a university’s setting by population density. The ordering of the values is:

Rural < Suburban < City < Urban

There are three options for cleaning an ordinal column, each with advantages and disadvantages. The ordinal column can be left as is, using the given string representation, mapped to integer values reflecting the ordering described above, or encoded using a Pandas Categorical data-type.

For example, one can create an explicit integer-encoding by passing a dictionary to the replace method:

encoding = {
    'Rural': 0,
    'Suburban': 1,
    'City': 2,
    'Urban': 3
}

uc['SETTING'].replace(encoding)

  2
  2
  1
    ..
  1
  1
  0
Name: SETTING, Length: 9, dtype: int64

Alternatively, Pandas has a categorical data-type that does this conversion automatically. However, a categorical data-type representing an ordinal column must have the order explicitly defined:

values = ['Rural', 'Suburban', 'City', 'Urban'] # with order!
setting_dtype = pd.CategoricalDtype(categories=values, ordered=True)
setting_dtype

CategoricalDtype(categories=['Rural', 'Suburban', 'City', 'Urban'], ordered=True)

uc['SETTING'].astype(setting_dtype)

0        City
1        City
2    Suburban
       ...   
6    Suburban
7    Suburban
8       Rural
Name: SETTING, Length: 9, dtype: category
Categories (4, object): [Rural < Suburban < City < Urban]

UNDERGRAD_POP Column: This column is quantitative, as arithmetic on this column makes sense. For example, the average or total undergraduate populations across schools represent meaningful quantities. As such, even though the column represents integer counts (students don’t come in fractions!), float-types can be considered appropriate, along with integer-types.

Finally, one can combine the cleaning steps above into a new table:

cleaned_uc = (
    pd.DataFrame().assign(
        COLLEGE_ID=uc['COLLEGE_ID'].astype(int),
        NAME=uc['NAME'],
        UNDERGRAD_POP=uc['UNDERGRAD_POP'],
        PCT_FED_LOAN=(uc['PCT_FED_LOAN'].str.strip('%').astype(float) / 100),
        MEDIAN_GRAD_DEBT=uc['MEDIAN_GRAD_DEBT'].str.strip('$').astype(float),
        SETTING=uc['SETTING'].astype(setting_dtype)
    )
)

cleaned_uc

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
0	131200	University of California-Berkeley	30574.0	0.2373	13200.0	City
1	131300	University of California-Davis	30046.0	0.3600	14000.0	City
2	131400	University of California-Irvine	29295.0	0.3640	16500.0	Suburban
...	...	...	...	...	...	...
6	132000	University of California-Santa Barbara	22181.0	0.3632	16300.0	Suburban
7	132100	University of California-Santa Cruz	17577.0	0.4641	19700.0	Suburban
8	4127100	University of California-Merced	7375.0	0.5013	19018.0	Rural

9 rows × 6 columns

cleaned_uc.dtypes

COLLEGE_ID             int64
NAME                  object
UNDERGRAD_POP        float64
PCT_FED_LOAN         float64
MEDIAN_GRAD_DEBT     float64
SETTING             category
dtype: object

Example: Sometimes a column contains values that do not easily coerce to the intended type. For example, suppose the Series given below needs to be represented as a column of float-type:

numbers = pd.Series('1,2,2.5,3 1/4,6 7/8,8.3,9,11,99/3'.split(','))
numbers

     1
     2
   2.5
     ... 
     9
    11
  99/3
Length: 9, dtype: object

Attempting to use astype(float) will result in a ValueError, as python doesn’t know how to convert ‘3 1/4’ to a float. To convert this column to float type, one must write a custom function that handles the case of fractional values.

def convert_mixed_to_float(s):
    '''converts a string representation of an integer,
    decimal, or proper fraction to a float'''
    if '/' not in s:
        return float(s)
    else:
        if ' ' in s:
            whole, frac = s.split(' ')
        else:
            whole, frac = '0', s

        num, denom = frac.split('/')
        return float(whole) + float(num) / float(denom)

numbers.apply(convert_mixed_to_float)

   1.0
   2.0
   2.5
     ... 
   9.0
  11.0
  33.0
Length: 9, dtype: float64

Remark: The pandas function pd.to_numeric(series, errors='coerce') blindly coerces values of a Series to numeric values. When the keyword argument errors='coerce' silently replaces any non-coercible values with a missing value. This often leads to dropping data that never should have been dropped; non-coercible data should be handled with care, as it’s often present due to systematic issues with how the data was recorded. If such a function was used on numbers, 1/3 of the data would be lost!

Summary: Cleaning up data-types¶

Quantitative Columns should be represented with numeric data types (float and integer types). However, float-types should only represent quantitative columns. As float values are subject to precision errors, one can only be guaranteed that a value is represented by a similar value. This similarity only makes sense with quantitative data.

Ordinal data should be represented with a data type that supports ordering (e.g. string or integer types) and has infinite precision (so the label doesn’t change). Three common approaches are outlined below:

left as is, using the given string representation.
- Advantages: easy to interpret the values (they remain unchanged)
- Disadvantages: the string ordering may not match the ordering of the column; strings are inefficient to store.
mapped to integer values, reflecting the ordering of the values of the column (called integer coding).
- Advantages: the ordering of the ordinal values are captured by the integers; integers are efficient to store.
- Disadvantages: integers obscure what the value originally represented.
encoded using a Pandas Categorical data-type.
- Advantages: stores and orders values as integers, while displaying them as strings.
- Disadvantages: relatively new feature in Pandas that may not always be available to use.

Nominal data should be represented by a data type with infinite precision. As the values are merely labels, any change in the values would be an arbitrary change to the values of the column.

Missing data¶

While Pandas encodes missing values using the value NaN, missing data may appear in a dataset in a variety ways. Different data processing and storage systems may use different conventions (e.g. NULL, None, "") or may disallow missing values entirely. This section focuses on strategies for identifying the missing values in a dataset; later, the course confronts the problem of how to handle the identified missing values.

Common examples of ‘placeholder’ missing values that are not obviously missing at first glance:

The integer 0 is a common stand-in for missing values, especially when the value is unlikely to occur in a dataset. For example, someone of age
The integer -1 is a common stand-in for missing values for columns containing non-negative values. For example, a column containing the “number of children in a family” might use -1 as a missing value.
1900 and 1970 are common placeholder for missing dates in Excel and UNIX respectively. Each of these conventions picks a time as ‘time zero’ and stores dates and times as “time elapsed since time zero” (1970 was roughly the year UNIX was created). In this way, these placeholders are both special case of using 0 as a missing value.

Example: The fitness tracking app Strava encodes missing location coordinates as 0,0. This encoding gives the impression of Strava user activity at 0°00’00.0”N+0°00’00.0”E, dubbed Null Island.

While this is a real location, it’s location in the middle of the Atlantic ocean makes it highly unlikely to be a meaningful value.
The common usage of 0 as a missing value makes it likely that this activity comes from users with their location tracking disabled.

Example: The bike sharing company Bluebikes makes bike usage data available for the city of Boston. Plotting the distribution of Bluebikes users’ year of birth reveals a large spike of riders born in the year 1969.

distribution_of_birth

Is this spike unreasonably large? Or is it possible riders of exactly that age use the bike share service that much more?

Most likely these values are missing values, as 1969 corresponds to the value -1 when encoded as a UNIX time-stamp. More problematic is that there really are riders whose birth year is 1969! In this case, telling apart the missing values from the true values may be difficult.

Example: The college scorecard data contains information on many colleges beyond the University of California campuses. The table colleges below contains a much larger sample of universities across the United States.

colleges = pd.read_csv('data/colleges.csv')
colleges.sample(5)

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
4261	1005200.0	Jefferson Lewis BOCES-Practical Nursing Program	77.0	68.18%	$7250	Rural
1604	2099700.0	Ross Medical Education Center-Flint	153.0	62.5%	$9500	Suburb
830	1113300.0	College of Eastern Idaho	698.0	34.04%	$9500	Urban
6031	4241500.0	Wave Leadership College	52.0	53.23%	PrivacySuppressed	Urban
4814	3855300.0	Ecclesia College	229.0	51.29%	$16500	Rural

The previous code written for cleaning MEDIAN_GRAD_DEBT no longer works in this larger dataset (check this code throws an exception), as some values don’t conform to the standard representation of US dollars. Before generalizing the cleaning code to the larger dataset, it’s necessary to understand the extent to which these “unusual values” are present.

def check_float(x):
    '''returns true if a value is coercible to a float;
    otherwise returns false'''
    try:
        float(x)
        return True
    except ValueError:
        return False

Print the distribution of values that aren’t of the the form $ followed by a string coercible to a float value:

nonnums = (
    colleges['MEDIAN_GRAD_DEBT']
    .str.strip('$')
    .apply(check_float)
)

colleges.loc[~nonnums, 'MEDIAN_GRAD_DEBT'].value_counts()

PrivacySuppressed    1175
Name: MEDIAN_GRAD_DEBT, dtype: int64

In the larger dataset, the only value that doesn’t represent dollar amounts is ‘PrivacySuppressed’. The meaning of this value is explained in the dataset documentation:

any data not reported in order to protect an individual’s privacy are shown as PrivacySuppressed.

Thus, it seems reasonable to interpret this value as a missing value and replace it with NULL.

def median_grad_debt_cleaner(ser):
    return ser.replace('PrivacySuppressed', np.NaN).str.strip('$')

colleges_cleaned = colleges.agg(csc_cleaning_map)

colleges_cleaned

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
0	100200	Alabama A & M University	4824.0	0.7697	32750.0	Urban
1	105200	University of Alabama at Birmingham	12866.0	0.5207	21833.0	Urban
2	2503400	Amridge University	322.0	0.8741	22890.0	Urban
...	...	...	...	...	...	...
6175	4106301	Palm Beach Academy of Health & Beauty-Distinct...	5.0	0.0714	6333.0	NaN
6176	295600	Piedmont International University	336.0	0.4847	12498.0	Urban
6177	4250501	National Personal Training Institute-Tampa	32.0	0.2982	NaN	NaN

6178 rows × 6 columns

Unfaithful data¶

The provenance of a dataset, from a real-world event to a dataset displayed in a notebook, is often long and complicated. However, for this data to be of any use, one must assess how well it captures the “reality” it’s meant to describe. Once a dataset is properly typed, it should be assessed for its faithfulness to the data generating process.

Generally, such an assessment involves asking if the data contain unrealistic or “incorrect” values? For example:

Are there dates in the future for events that occurred the past?
Are there locations in the dataset that don’t exist?
Are there negative counts? (is that a missing value?)
Are names misspelled? Do single names have variants such as nicknames?
Are there unreasonably large outliers?

Assessing for faithfulness involves being skeptical of the data known data provenance and doing research into the assumptions on how the data were generated. This assessment generally involves understanding and identifying problems with the data, then attempting to “fix” the problem. How to fix unfaithful data depends on the context:

Poor quality observations may be dropped if there are very few or sufficient random.
The likely values might be (approximately) inferable from either researching into data provenance, or from other values in the dataset.
The source of problem might be fixed through changes in the experimental design, instrumentation, or data collection.

Assessing the faithfulness of the data to the data generating process involves interpreting exploratory data analyses of the attributes.

Example: The ‘UNDERGRAD_POP’ attribute represents the count of undergraduates at colleges in the dataset. An initial assessment of this attribute involves plotting the distribution of undergraduates and analyzing the values and their proportions:

colleges_cleaned['UNDERGRAD_POP'].plot(
    kind='hist', bins=25, logy=True, 
    title='distrubtion of college population (undergraduate)'
);

One interesting observation is how many small colleges are in the dataset (note the log-scale!). How small are they? Are the values ever negative?

Plotting colleges with fewer than 1 student reveals six colleges with an undergraduate population of size 0 and no colleges with negative student populations:

# What could this possibly mean?
colleges_cleaned.loc[(colleges_cleaned.UNDERGRAD_POP <= 0)]

	COLLEGE_ID	NAME	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
2702	2154402	Beau Monde College of Hair Design	0.0962	8973.0	Urban
3107	2188300	Pentecostal Theological Seminary	0.0000	NaN	Urban
4934	489812	Miami-Jacobs Career College-Springboro	0.9706	24549.5	NaN
5164	469205	Dorsey Business Schools-Farmington Hills	1.0000	12986.0	Urban
5574	1091302	Minneapolis Media Institute	0.8706	14438.0	NaN
6155	405740	National American University-Westwood Teach-Ou...	0.7027	30223.0	Urban

How might a school have an undergraduate population of size zero? Is this a mistake or reality?

The answer may be in how the government defines an undergraduate student. It’s reasonable to assume that certain schools, like beauty schools or theological seminaries, might service populations outside of an undergraduate program. However, this also brings up the question of why there aren’t more colleges with an undergraduate population of zero! More research is necessary to understand the reasonableness of a zero value in this column.

Example: The college NAME attribute contains the name of each college in the dataset. One might expect this column to uniquely represent a college; computing the head of the empirical distribution of college names reveals this is not the case:

colleges_cleaned.NAME.value_counts()

Stevens-Henager College                   7
Columbia College                          5
McCann School of Business & Technology    5
                                         ..
Greensboro College                        1
Austin Kade Academy                       1
University of California-Davis            1
Name: NAME, Length: 6068, dtype: int64

Among these 7 duplicate ‘Stevens-Henager Colleges’, are they:

Truly the same college? (e.g. an error in recording)
Are they different, but somehow related? (e.g. different branches of the same school)
Are they totally unrelated? (e.g. a total coincidence they have the same name).

colleges_cleaned[colleges_cleaned.NAME == 'Stevens-Henager College']

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
3384	367400	Stevens-Henager College	205.0	0.8255	27139.0	NaN
3385	367401	Stevens-Henager College	143.0	0.7831	27139.0	Urban
4521	367403	Stevens-Henager College	563.0	0.7481	27139.0	NaN
4833	367405	Stevens-Henager College	72.0	0.7561	27139.0	Urban
5438	367406	Stevens-Henager College	259.0	0.7057	27139.0	Urban
5576	3120302	Stevens-Henager College	111.0	0.8455	25732.0	Rural
5681	367411	Stevens-Henager College	111.0	0.8281	27139.0	Urban

Since the school statistics differ across the seven colleges, the observations are likely not true duplicates. However, they might be somehow related, as the COLLEGE_IDs are close. More research is necessary to answer this question fully. Next steps might include:

Retrieving the other attributes from the full college scorecard dataset (e.g. location),
Looking up the colleges from another source to match the schools in the table to outside information.

Code Design¶

The task of data cleaning can be complicated and ad-hoc. The decisions involved in developing data cleaning code often uses domain-specific judgment calls that are far from obvious to someone attempting to understand the process (including the developer themselves, not long after writing the code). Data Scientists should write organized, easy-to-read, easy-to-adapt data cleaning code.

An reasonable approach to writing cleaning code that deals with one column at a time is to:

create a cleaning function for each column,
store the cleaning functions in a dictionary, keyed by the name of the column to be cleaned,
apply the cleaning functions to the columns using DataFrame.agg.

While it may seem overkill to write a separate function for each column, this boilerplate has several advantages over ad-hoc procedural code:

The docstring of each cleaning function can contain descriptions of both the cleaning itself as well as assumptions made in the development of the cleaning logic.
Organizing logic into separate functions reduces code complexity as columns are added/changed, or as the cleaning code evolves into more sophisticated logic.
Applying a dictionary of functions via agg allows same code structure to work with parallel processing library such as Spark and Dask.

Example: Using this approach, the cleaning code for the College Scorecard data looks as follows:

def college_id_cleaner(ser):
    '''returns identifier COLLEGE_ID as an integer type'''
    return ser.astype(int)

def pct_fed_loan_cleaner(ser):
    '''returns PCT_FED_LOAN as a proportion between 0 and 1.'''
    return ser.str.strip('%').astype(float) / 100

def median_grad_debt_cleaner(ser):
    '''returns MEDIAN_GRAD_DEBT as a float (in USD)'''
    return ser.replace('PrivacySuppressed', np.NaN).str.strip('$').astype(float)

def setting_cleaner(ser):
    '''returns SETTING column as a category data-type
    ordered as RURAL < SUBURBAN < CITY < URBAN'''
    return ser.astype(setting_dtype)

csc_cleaning_map = {
    'COLLEGE_ID': college_id_cleaner,
    'PCT_FED_LOAN': pct_fed_loan_cleaner,
    'MEDIAN_GRAD_DEBT': median_grad_debt_cleaner,
    'SETTING': setting_cleaner
}

csc_cleaning_map = {x: csc_cleaning_map.get(x, lambda x:x) for x in colleges.columns}

colleges.agg(csc_cleaning_map)

	COLLEGE_ID	NAME	UNDERGRAD_POP	PCT_FED_LOAN	MEDIAN_GRAD_DEBT	SETTING
0	100200	Alabama A & M University	4824.0	0.7697	32750	Urban
1	105200	University of Alabama at Birmingham	12866.0	0.5207	21833	Urban
2	2503400	Amridge University	322.0	0.8741	22890	Urban
...	...	...	...	...	...	...
6175	4106301	Palm Beach Academy of Health & Beauty-Distinct...	5.0	0.0714	6333	NaN
6176	295600	Piedmont International University	336.0	0.4847	12498	Urban
6177	4250501	National Personal Training Institute-Tampa	32.0	0.2982	NaN	NaN