Aggregation and Extension of Data¶

Content Summary¶

This chapter covers more drastic data manipulation and data transformation techniques to improve usefulness of a dataset. These techniques include:

Grouping data and applying transformations across those groups,
Manipulating the granulary to a coarser view of the data, while understanding the information lost from applying such a transformation,
Adding new observations to an existing dataset, paying special attention to potential differences in the process that generated the datasets,
Adding new attributes to existing observations, paying special attention to how an imperfect correndspondence may bias the original dataset.
Assesing the differences between populations of a dataset using statistical inference (permutation tests).

The two datasets used in this chapter consist of:

In the lists below, assume that the usual imports have been executed:

import pandas as pd
import numpy as np
import seaborn as sns

Aggregation methods:

Function or Method Name	Description
`groupby`	Split-Apply-Combine processing on tables
`agg`	Apply collections of functions to groups
`transform`	Apply transformations to groups
`apply`	Apply general functions to groups
`filter`	Filter out groups based on conditions

Reshaping methods:

Function or Method Name	Description
`pivot_table`	Reshape (pivot) the entries of a DataFrame

Appending and joining methods:

Function or Method Name	Description
`pd.concat`	Concatentate a list of dataframes by rows/columns
`merge`	Join two DataFrames by common columns

Datetime:

Function or Method Name	Description
`pd.to_datetime`	convert strings to datetime objects
`dt` namespace	datetime related properties and methods

Plotting:

Function or Method Name	Description
`pd.plotting.scatter_matrix`	plot a scatter-matrix
`sns.scatterplot`	scatter-plot with easy customization
`sns.catplot`	(strip/box)-plotting by categories