Understanding Aggregations¶

Data aggregation is a powerful tool that allows one to understand a population via summarization; changing the granularity of data from individual observations to groups of observations can make patterns suddenly clear.

Computing distributions is an example.
descriptions grouped by labels, instead of individuals.

Reshaping Data: Pivot Tables¶

Pivot tables are compact ways of summarizing groups within data. The pivot refers to reshaping a table from a long table of rows indexed by two characteristics, to a wide table with one characteristic per axis.

As a simple example, consider a table of aggregate student records resulting from a groupby operation; each record contains the GPA of all members of one year’s student body for a given department:

Year	Course Dept	Avg Grade
Senior	CS	3.3
Senior	Math	2.9
Senior	DS	3.1
Junior	CS	3.4
Junior	Math	3.0
Junior	DS	3.0
Sophomore	CS	3.0
Sophomore	Math	3.1
Sophomore	DS	2.9
Freshman	CS	3.2
Freshman	Math	3.2
Freshman	DS	3.2
…	…	…

While this table is easy to work with, drawing quick comparisons across attributes is cumbersome. A pivot table created from these averages involves reshaping the long table to a wider, more compact, format. In this case:

the index of the pivot table contains the unique values of ‘Course Dept’,
the columns of the pivot table contains the unique values of ‘Year’,
The values of the pivot table contains the values of ‘Avg Grade’

The pivot table:

—	Freshman	Sophomore	Junior	Senior
CS	3.2	3.0	3.4	3.3
Math	3.2	3.1	3.0	2.9
DS	3.2	2.9	3.0	3.1
…	…	…	…	…

In Pandas, such a reshaping is done with the pivot method.

Remarks:

In the pivot table one can easily spot the trend that, as students progress, math classes get harder, while CS classes get easier.
The ‘Year’ values were set as the columns of the pivot table, as there are only 4 distinct values; there may be many departments.
The pivot is merely a reshaping of the original table; the number of entries in the pivot is the same as the number of rows in the original table.

Example: Recall the dataset consisting of attempts to summit Mt. Rainier. Even among just the 10 most popular climbing routes, listing the number of climbers who attempted each route on every given day results in a long table with 895 rows. Such a table is hard to manually inspect!

climbers_by_route_day = popular_climbs.groupby(['Date', 'Route'])['Attempted'].sum().reset_index()
climbers_by_route_day

	Date	Route	Attempted
0	1/16/2014	Little Tahoma	2
1	1/18/2014	Gibraltar Ledges	8
2	1/19/2014	Gibraltar Ledges	2
...	...	...	...
892	9/8/2015	Disappointment Cleaver	2
893	9/9/2014	Disappointment Cleaver	29
894	9/9/2015	Disappointment Cleaver	16

895 rows × 3 columns

The pivot table present the attempts by day and route in a matrix:

climbers_by_route_day.pivot(index='Date', columns='Route', values='Attempted')

Route	Disappointment Cleaver	Emmons-Winthrop	Fuhrer's Finger	Gibraltar Ledges	Ingraham Direct	Kautz Glacier	Liberty Ridge	Little Tahoma	Ptarmigan Ridge
Date
1/16/2014	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2.0	NaN
1/18/2014	NaN	NaN	NaN	8.0	NaN	NaN	NaN	NaN	NaN
1/19/2014	NaN	NaN	NaN	2.0	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...
9/8/2015	2.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9/9/2014	29.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9/9/2015	16.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

386 rows × 9 columns

In this pivot table, the entries with NaN values correspond to day where a climbing route had no one attempt it; filling in the NaN entries with zero in this case makes sense (this is not always the case, however!).

The pattern of performing a groupby operation followed by a reshaping operation is so common, Pandas provides a method that combines the two.

The pivot_table DataFrame method is like pivot with an additional keyword argument aggfunc that applies the provided function on the groups defined by the columns specified by the index/column arguments:

popular_climbs.pivot_table(
    index='Date', 
    columns='Route', 
    values='Attempted', 
    aggfunc='sum',
    fill_value=0
)

Route	Disappointment Cleaver	Emmons-Winthrop	Fuhrer's Finger	Gibraltar Ledges	Ingraham Direct	Kautz Glacier	Liberty Ridge	Little Tahoma	Ptarmigan Ridge
Date
1/16/2014	0	0	0	0	0	0	0	2	0
1/18/2014	0	0	0	8	0	0	0	0	0
1/19/2014	0	0	0	2	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...
9/8/2015	2	0	0	0	0	0	0	0	0
9/9/2014	29	0	0	0	0	0	0	0	0
9/9/2015	16	0	0	0	0	0	0	0	0

386 rows × 9 columns

Remark: The pivot_table is equivalent to pivot when the aggregation function is the identity function (aggfunc=lambda x:x).

Conditional distributions are pivot tables¶

Example: To understand where and when most individual climbers are on Mt. Rainier, compute

the distribution of Routes that climber’s use to attempt to summit the mountain,
the distribution of months of the year when climbers attempt to summit the mountain,
the conditional distribution of routes climbed by month of the year.

Note that this is not asking for the distribution of routes used by climbing parties (which corresponds to a row of the dataframe), but rather the distribution of routes used by climbers.

Before calculating the desired distributions, the month of the year must be stripped from the ‘Date’ attribute:

popular_climbs_with_month = popular_climbs.assign(Month=popular_climbs['Date'].str.split('/').str[0].astype(int))

The distribution of routes used by climbers is computed by calculating the total number of climbers that attempted each route up the mountain, then normalizing each sum by the total number of climbers:

climbers_by_route = popular_climbs_with_month.groupby('Route')['Attempted'].sum()
climbers_by_route_distr = climbers_by_route / climbers_by_route.sum()
climbers_by_route_distr

Route
Disappointment Cleaver    0.745837
Emmons-Winthrop           0.148707
Fuhrer's Finger           0.013372
                            ...   
Liberty Ridge             0.008082
Little Tahoma             0.014498
Ptarmigan Ridge           0.002792
Name: Attempted, Length: 9, dtype: float64

(
    climbers_by_route_distr
    .sort_values(ascending=False)
    .plot(kind='bar', title='distribution of routes used by individual climbers')
);

../../_images/understanding-aggregations_17_0.png

Notice that two of the routes on the mountain account for almost 90% of the climbers on the mountain during that time! Similarly, the distribution of months in which individual climbers attempted to summit Mt. Rainier is calculated by:

climbers_by_month = popular_climbs_with_month.groupby('Month')['Attempted'].sum().sort_index()
climbers_by_month_distr = climbers_by_month / climbers_by_month.sum()
climbers_by_month_distr

Month
1     0.005192
2     0.003967
3     0.007494
        ...   
10    0.001567
11    0.000245
12    0.000147
Name: Attempted, Length: 12, dtype: float64

climbers_by_month_distr.plot(kind='bar');

../../_images/understanding-aggregations_20_0.png

Most climbers attempt to summit Rainier during the summer months of June, July, and August.

To compute the conditional distribution, calculate the total number of climbers that attempted a route in a given month, then normalize by the total number of climbers. This gives the proportion of climbing activity that occurs in a given time and place on the mountain.

First, calculate the sums – recall that pivot_table performs the groupby!

climbers_by_month_route = (
    popular_climbs_with_month
    .pivot_table(index='Month', columns='Route', values='Attempted', aggfunc='sum', fill_value=0)
)

climbers_by_month_route

Route	Disappointment Cleaver	Emmons-Winthrop	Fuhrer's Finger	Gibraltar Ledges	Ingraham Direct	Kautz Glacier	Liberty Ridge	Little Tahoma	Ptarmigan Ridge
Month
1	20	0	5	56	21	0	0	4	0
2	23	0	4	36	16	0	0	2	0
3	60	0	7	19	62	0	0	5	0
...	...	...	...	...	...	...	...	...	...
10	20	0	0	0	0	4	0	8	0
11	5	0	0	0	0	0	0	0	0
12	0	0	0	3	0	0	0	0	0

12 rows × 9 columns

Then normalize by the total number of climbers:

total_climbers = climbers_by_month_route.sum().sum()

climbers_by_month_route_distr = climbers_by_month_route / total_climbers
climbers_by_month_route_distr

Route	Disappointment Cleaver	Emmons-Winthrop	Fuhrer's Finger	Gibraltar Ledges	Ingraham Direct	Kautz Glacier	Liberty Ridge	Little Tahoma	Ptarmigan Ridge
Month
1	0.000980	0.0	0.000245	0.002743	0.001029	0.000000	0.0	0.000196	0.0
2	0.001127	0.0	0.000196	0.001763	0.000784	0.000000	0.0	0.000098	0.0
3	0.002939	0.0	0.000343	0.000931	0.003037	0.000000	0.0	0.000245	0.0
...	...	...	...	...	...	...	...	...	...
10	0.000980	0.0	0.000000	0.000000	0.000000	0.000196	0.0	0.000392	0.0
11	0.000245	0.0	0.000000	0.000000	0.000000	0.000000	0.0	0.000000	0.0
12	0.000000	0.0	0.000000	0.000147	0.000000	0.000000	0.0	0.000000	0.0

12 rows × 9 columns

The table representing the conditional distribution is quite large; there are 120 combinations of month/route, most of which represent very few climbers. Looking at only the most popular routes, during the most popular months:

climbers_by_month_route_distr.loc[
    [5,6,7,8,9], # summer months
    ['Disappointment Cleaver', 'Emmons-Winthrop', 'Kautz Glacier'] # top three most popular Routes
]

Route	Disappointment Cleaver	Emmons-Winthrop	Kautz Glacier
Month
5	0.075823	0.004065	0.001322
6	0.182259	0.048491	0.017535
7	0.216742	0.070337	0.018417
8	0.175647	0.025078	0.007886
9	0.084542	0.000735	0.000245

Notice that ~20% of the climbing on Mt. Ranier occurs during a single year, on a single route!

Remark: One can recover the distributions of routes/months (i.e. the marginal distributions) by summing the rows/columns of the conditional distribution.

# Distribution of Routes
climbers_by_month_route_distr.sum()

Route
Disappointment Cleaver    0.745837
Emmons-Winthrop           0.148707
Fuhrer's Finger           0.013372
                            ...   
Liberty Ridge             0.008082
Little Tahoma             0.014498
Ptarmigan Ridge           0.002792
Length: 9, dtype: float64

# Distribution of Months
climbers_by_month_route_distr.sum(axis=1)

Month
1     0.005192
2     0.003967
3     0.007494
        ...   
10    0.001567
11    0.000245
12    0.000147
Length: 12, dtype: float64

Undoing a pivot¶

While statistics of datasets are often more readable in a pivot table, a longer table with well-defined observations and attributes is easier to compute with. If a table is stored as a pivot table, one can undo the pivot using the melt method.

Note that melt only undoes the reshaping operation of pivot – in general, the groupby aggregations cannot be inverted!

climbers_by_month_route.reset_index().melt(id_vars='Month', value_name='Total Attempts')

	Month	Route	Total Attempts
0	1	Disappointment Cleaver	20
1	2	Disappointment Cleaver	23
2	3	Disappointment Cleaver	60
...	...	...	...
105	10	Ptarmigan Ridge	0
106	11	Ptarmigan Ridge	0
107	12	Ptarmigan Ridge	0

108 rows × 3 columns

Reversing aggregations: ecological fallacies¶

Grouping “like” individuals together and reducing them to a few numbers like a median or mean turns understanding a large population a tractable problem. However, this process necessarily loses information about the underlying population. When working with aggregate data, keeping clear the population of interest is of the utmost importance; confusing individuals for groups of individuals often leads to incorrect conclusions!

An ecological fallacy refers to drawing conclusions about individuals through conclusions made from aggregate descriptions of the data. Such conclusion often fail for two complementary reasons:

An aggregate view of the data obscures the relationship between the groups. For example, viewing a population through a group-wise mean masks the relative sizes of the groups.
An aggregate view of the data hides the riches of the intra-group population. For example, view a population through a group-wise mean conceals the shape of the distributions within each group (e.g. the existence of outliers, skewness, etc).

Example: Suppose a group of people have higher incomes on average than the population as a whole. This does not imply that a randomly chosen member of the group is more likely to make an above average wage. As an extreme example to see why this is true, consider a group of 5 individuals: 4 of them makes \\(0 per year and the other makes \\\)1M per year. The group average is \\(200K per year, which is above the average income in the population. However, 80% of the time, a randomly chosen individual from the group makes below the average income of the population, since they make \)0 per year. Drawing a conclusion about the individuals in a group is more related to the median of the distribution of the group than the mean; however, this quantity is not contained in the aggregate description.

Simpson’s Paradox¶

Simpson’s Paradox is less a paradox and more a simple arithmetic phenomenon: group-wise averages can look drastically different depending on how finely the data are grouped. The phenomenon seems paradoxical from the tendency to make inferences about individuals from these group-wise summaries: the grouped data can describe an individual in two seemingly incompatible ways.

Simpson’s Paradox occurs in a dataset when aggregate trends reverse upon further partitioning the data. The most simple examples concern average rates of occurrence of some event:

Example: A medical study in BMJ studied the relative success rates of two different procedures removing kidney stones from patients. The first procedure involves removing the stone via traditional “open surgery” (OS), whereas the second procedure involves removing the stone using a small puncture (PN). Among 700 patients treated, the success rate of PN was higher than the success rate of OS:

OS	PN
78% (273/350)	83% (289/350)

However, OS had a higher success rate when used on patients with smaller kidney stones, as well as on patients with larger kidney stones:

—	OS	PN
Small Stones	93% (81/87)	87% (234/270)
Large Stones	73% (192/263)	69% (55/80)
Both	78% (273/350)	83% (289/350)

Which treatment is more effective? The two pictures contradict each other!

Observing the raw numbers in the study, one notices most of the procedures fall into only 2/4 cases:

Open Surgery (OS) on large stones, and
Puncture procedure (PN) on small stones.

That is to say, treatment OS was mostly evaluated on large kidney stones, whereas treatment PN was largely evaluated on small kidney stones. This comparison is not a fair one, as removing a large kidney stone is more difficult operation than removing a small one. As such, it seems that Open Surgery may actually be the more effective procedure.

However, randomizing the two groups for treatment is the best way to truly determine effectiveness, as there may be other, similar attributes that have a similar effective and further reverse the conclusion!

Example: Suppose a mountain climber with ambitions of climbing Mt. Rainier is trying to decide both

which route to climb to attempt to summit the mountain and
whether to attempt the climb in the summer or the winter.

Being quantitatively inclined, she turns to the dataset of past climbing attempts. First, she would like to either climb the ‘Emmons-Winthrop’ route (the most straightforward route, up the mountain’s largest glacier) or the ‘Gibraltar Ledges’ route (the route used in the first known successful summit of Rainier).

Computing the success rate that mountaineers for each route, she notes that people have more success summiting the Emmons-Winthrop route:

routes = ['Emmons-Winthrop', 'Gibralter Ledges']

(
    popular_climbs
    .loc[popular_climbs['Route'].isin(routes)]
    .groupby('Route')
    .sum()
    .apply(lambda x:x['Succeeded'] / x['Attempted'], axis=1)
    .rename('Success Rate')
    .to_frame()
)

	Success Rate
Route
Emmons-Winthrop	0.549736

Next, she would like to know when to climb the route, so she computes the success rates of each route when summiting during summer vs winter:

compute a column specifying the season (e.g. summer is between June and October inclusive, winter is everyting else),
compute the rates by season and route

summer_or_winter = (
    popular_climbs_with_month['Month']
    .apply(lambda x: 'Winter' if (x > 10 or x < 6) else 'Summer')
)

ew_gl_cnts = (
    popular_climbs
    .assign(Season=summer_or_winter)
    .loc[popular_climbs['Route'].isin(routes)]
    .pivot_table(
        index='Season', 
        columns='Route', 
        values=['Attempted', 'Succeeded'], 
        aggfunc='sum')
)

ew_gl_cnts

	Attempted	Succeeded
Route	Emmons-Winthrop	Emmons-Winthrop
Season
Summer	2953	1649
Winter	83	20

ew_gl_success_rates = ew_gl_cnts['Succeeded'] / ew_gl_cnts['Attempted']
ew_gl_success_rates

Route	Emmons-Winthrop
Season
Summer	0.558415
Winter	0.240964

Breaking down the success rates by season reveals a reversal. Counter to the previous comparison, Gibraltar Ledges has a higher success rate in both the winter and the summer! But which route should she climb? and when?

A peak at the underlying raw number of attempts is illuminating:

ew_gl_cnts['Succeeded']

Route	Emmons-Winthrop
Season
Summer	1649
Winter	20

The primary observations of these counts that are hidden in the tables of success rates are:

The number of climbers that attempted Gibraltar Ledges perhaps too small to trust! Perhaps six of the eight climbers who successfully climbed to the summit were merely lucky?
The ‘Emmons-Winthrop’ route is primarily attempted in summer, whereas the ‘Gibraltar Ledges’ route is primarily attempted in winter.

Further background research also reveals that climbing in the winter is typically a more demanding, dangerous affair, with unsettled weather and shorter days; this observation might imply that the population of climbers attempting routes in the winter may be more skilled than climbers attempting to summit the mountain in the summer. Thus, most climbers on Gibraltar Ledges are actually better climbers that are more likely to successfully summit.

Our data scientist, recognizing herself as being new to the endeavor, realizes that climbing the Emmons-Winthrop route is her best bet for success.

Remark: Don’t be fooled by the 75% success rate! For both of the reasons outlined above, attempting the Gibraltar Ledges route in the Summer is a terrible idea: “Due to loose rock, this route should only be attempted in cold conditions.” (See more here on the route description).

Data Science in Practice

Understanding Aggregations

Contents