Understanding Assumptions and Data Cleaning¶

Content Summary¶

This chapter focuses on assessing a dataset in terms of one’s understanding of the likely process that generated the data. This understanding is translated into code that ‘cleans’ the data into a useable format that faithfully represents the process that generated it.

The required topics to understand and clean data are:

Techniques for modifying tables, to both build summaries of a given dataset, as well as to incrementally clean a dataset,
Use data provenance to understand common ways in which a dataset needs cleaning, whether it be issues with data types, systematically incorrect values, or unfaithful data.
Hypothesis testing helps one understand a dataset as just one sample of an underlying data generating process. This helps ones assess the quality of the dataset and how it aligns with what is known about the process that generated it.

Datasets¶

The primary dataset in this chapter is the College Scorecard Dataset. This dataset is compiled and published by the US government to provide the public with information about each Title IV college in the United States.

Summary of Library References¶

Function or Method Name	Description
`assign`	Add a new column to a DataFrame
`str` namespace	string methods for Series values
`replace`	Replace a given value
`sample`	Sample observations of a dataset

Data Science in Practice

Understanding Assumptions and Data Cleaning

Contents

Understanding Assumptions and Data Cleaning¶

Content Summary¶

Datasets¶

Summary of Library References¶