# Data Granularity:

## Contents

# Data Granularity:¶

## From questions to code¶

While determining what measurements to make to understand a process is clearly important, the difficulty of understanding *what* or *who* to measure shouldn’t be overlooked. One needs to be clear about both the individuals being observed, along with the attributes being measured.

**Example:** Suppose a Data Scientist, working at video streaming platform, studies users’ viewing behavior. The data she uses might come in two possible forms:

Each row represents an event comprising of a user streaming a particular video; attributes include the user streaming the video, the name of the video, the length of the streaming event, whether the video was watched to completion, etc.

Each row represents a user; attributes include the average and total time spent streaming, how often a video was watched to completion, etc.

These datasets have different *data designs*; each is better suited to answer certain questions than the other. However, the user-level data is fundamentally an aggregate view of the event-level data: it gives a *coarser* view of the data. Below are a few differences between the two data designs.

— |
Event Level (fine) |
User Level (coarse) |
---|---|---|

Answering questions about: |
cost of running servers |
net revenue per user |

Cost implications of data collection: |
large/expensive to store |
smaller/cheaper to store |

Privacy implications of data collection: |
possibly detailed personal data |
more easily anonymized |

Flexibility of use: |
usable for a broad range of questions |
drops more information about underlying events |

… |
… |
… |

Suppose, for example, the Data Scientist notices when looking at the user-level data that 5% of paying users accounts for 95% of all video watched on the platform. A knee-jerk response might be to discourage these users from such heavy use, as this small number of paying subscribers account for most of the costs of running the platform. However, a look at the event-level data might reveal that these users all have something in common: they watch the new live-streaming channels that the company is promoting, which consists of inordinately long events. In fact, the root of the problem lies in the service being promoted, not the users.

Data can be represented at different levels of granularity and it’s useful to move from a fine grained picture to coarser summaries while understanding what information is lost. An astute Data Scientist treats coarse grained data with skepticism, without losing the ability to effectively use it.

## Group-wise operations: split-apply-combine¶

Moving to coarser views of a dataset involves *grouping individuals into subpopulations* and aggregating the measurements of the subpopulation into a table containing the aggregate summaries. This process follows the logic of split-apply-combine nicely summarized in the context of data analysis by Hadley Wickham. This data processing strategy is useful across a very broad class of computational task, allowing one to:

cleanly describe processing logic into easy-to-understand, discrete steps, and

scale a solution, using parallel processing, to large datasets.

The components of split-apply-combine are as follows:

**Split**the data into groups based on some criteria (often encoded in a column).**Apply**function(s) to each group independently.**Combine**the results into a data structure (e.g. a Series or DataFrame)

For example, given a table with group labels A, B, C in a column `Col1`

, the sum of values in each group is calculated by splitting the table in a table for each group, summing the values in each groups’ table separately, and combining the resulting sums in a new, smaller, table (see the figure below).

This abstraction allows one to develop transformations on a single table without worrying about the particulars of how it’s applied across groups and returned as an aggregated result.

### Split: the `groupby`

DataFrame method¶

In the context of Pandas DataFrames, the split-apply-combine paradigm is implemented via the DataFrameGroupBy class.

`DataFrame.groupby(key)`

is a DataFrame method that represents the “split” portion of split-apply-combine. The `DataFrameGroupBy`

object it returns is a dictionary like object, keyed on a (list of) column name(s) `key`

, with values given by the indices of each group.

**Example:** `df`

defined below is a DataFrame with group labels contained in `Col1`

and integer values in `Col2`

and `Col3`

. Several useful `GroupBy`

properties and methods are illustrated below:

```
df = pd.DataFrame({
'Col1': 'A B C C B B A'.split(),
'Col2': [1, 2, 3, 4, 2, 5, 3],
'Col3': [4, 3, 1, 2, 4, 5, 3]
})
df
```

Col1 | Col2 | Col3 | |
---|---|---|---|

0 | A | 1 | 4 |

1 | B | 2 | 3 |

2 | C | 3 | 1 |

3 | C | 4 | 2 |

4 | B | 2 | 4 |

5 | B | 5 | 5 |

6 | A | 3 | 3 |

The `GroupBy`

object does not have a descriptive string representation:

```
G = df.groupby('Col1')
G
```

```
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11ff2ae80>
```

However, the `groups`

property returns a dictionary keyed by group label, with the indices of `df`

corresponding to each group label:

```
G.groups
```

```
{'A': Int64Index([0, 6], dtype='int64'),
'B': Int64Index([1, 4, 5], dtype='int64'),
'C': Int64Index([2, 3], dtype='int64')}
```

The `G.get_group(label)`

method returns the DataFrame `df.loc[df[key] == label]`

:

```
G.get_group('A')
```

Col1 | Col2 | Col3 | |
---|---|---|---|

0 | A | 1 | 4 |

6 | A | 3 | 3 |

If the aggregation only involves one of the columns, selecting that column saves space. The resulting object is a `SeriesGroupBy`

object, which is similar to a `DataFrameGroupBy`

object:

```
G['Col2']
```

```
<pandas.core.groupby.generic.SeriesGroupBy object at 0x11a64df28>
```

```
G['Col2'].get_group('A')
```

```
0 1
6 3
Name: Col2, dtype: int64
```

**Example:** In place of a column name, `groupby`

also accepts a function to split data into groups:

the domain of the function is the index of the DataFrame,

the groups are labeled by the image of the function,

each group consists of indices that map to the same value under the function.

For example, the following function creates a `GroupBy`

object with three groups: Small, Medium, and Large.

```
def grouper(x):
if x <= 2:
return 'Small'
elif 3 < x <= 4:
return 'Medium'
else:
return 'Large'
```

```
(
df
.set_index('Col2')
.groupby(grouper)
.groups
)
```

```
{'Large': Int64Index([3, 5, 3], dtype='int64', name='Col2'),
'Medium': Int64Index([4], dtype='int64', name='Col2'),
'Small': Int64Index([1, 2, 2], dtype='int64', name='Col2')}
```

### Apply-Combine: `GroupBy`

methods¶

Recall that the ‘apply-combine’ steps consist of applying a transformation to the groups created in the ‘split’ step, followed by combining the separate results into a single data structure. Pandas `GroupBy`

methods couple these two steps: while each method implements its own apply logic, all of them combine the transformed groups into a natural output data structure.

Many DataFrame methods also naturally extend to `GroupBy`

objects. The Pandas documentation contains an exhaustive list.

**Example:** Calculating the mean of the groups of the of the simple DataFrame `df`

defined in the previous example:

```
df
```

Col1 | Col2 | Col3 | |
---|---|---|---|

0 | A | 1 | 4 |

1 | B | 2 | 3 |

2 | C | 3 | 1 |

3 | C | 4 | 2 |

4 | B | 2 | 4 |

5 | B | 5 | 5 |

6 | A | 3 | 3 |

```
G = df.groupby('Col1')
G.mean()
```

Col2 | Col3 | |
---|---|---|

Col1 | ||

A | 2.0 | 3.5 |

B | 3.0 | 4.0 |

C | 3.5 | 1.5 |

**Example:** The table `climbs`

contains a list of almost 4000 attempts at climbing to the summit of Mt. Rainier in Washington State, during the years 2014-2015. Each row represents a party of climbers attempting to climbing a specified *route* up the mountain, on a given *date*.

```
climbs
```

Date | Route | Attempted | Succeeded | |
---|---|---|---|---|

0 | 11/27/2015 | Disappointment Cleaver | 2 | 0 |

1 | 11/21/2015 | Disappointment Cleaver | 3 | 0 |

2 | 10/15/2015 | Disappointment Cleaver | 2 | 0 |

... | ... | ... | ... | ... |

3966 | 1/6/2014 | Disappointment Cleaver | 8 | 0 |

3967 | 1/5/2014 | Disappointment Cleaver | 2 | 0 |

3968 | 1/4/2014 | Gibraltar Ledges | 3 | 2 |

3969 rows × 4 columns

Most parties are composed of multiple climbers, as evidenced by the ‘Attempted’ column. The total number of people who attempted each route over the two year period is:

```
climbs.groupby('Route')['Attempted'].sum()
```

```
Route
Curtis Ridge 4
Disappointment Cleaver 15227
Edmonds HW 4
...
Tahoma Cleaver 3
Tahoma Glacier 33
Wilson Headwall 5
Name: Attempted, Length: 23, dtype: int64
```

Notice the resulting Series is sorted by the grouping *key*; there are 24 routes sorted alphabetically.

To calculate the routes with the most climbers on a given day, group the climbs by both date and route name:

```
(
climbs
.groupby(['Date', 'Route'])['Attempted']
.sum()
.sort_values(ascending=False)
.reset_index()
)
```

Date | Route | Attempted | |
---|---|---|---|

0 | 7/25/2014 | Disappointment Cleaver | 182 |

1 | 6/26/2015 | Disappointment Cleaver | 144 |

2 | 7/3/2015 | Disappointment Cleaver | 135 |

... | ... | ... | ... |

943 | 5/28/2015 | Kautz Glacier | 1 |

944 | 6/13/2015 | Kautz Glacier | 1 |

945 | 8/8/2015 | Emmons-Winthrop | 1 |

946 rows × 3 columns

*Remark:* Grouping on multiple columns results in a multi-indexed Series/DataFrame; using `reset_index`

on the output simplifies further processing.

Similarly to the case of DataFrames, the `GroupBy`

method ** agg** applies columnar Series methods to the groups of a

`GroupBy`

object. Compare how each of the following outputs relates to the analogous DataFrame method.Passing a list of functions applies each function to each column of each group

Passing a dictionary of functions applies the functions to the specified column in each group.

**Example:** Using `agg`

quickly creates a compact table of descriptive statistics of the attempts up the routes on Mt. Rainier:

```
climbs.groupby('Route').agg(['count', 'sum', 'mean', np.median, 'max'])
```

Attempted | Succeeded | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

count | sum | mean | median | max | count | sum | mean | median | max | |

Route | ||||||||||

Curtis Ridge | 2 | 4 | 2.000000 | 2.0 | 2 | 2 | 2 | 1.000000 | 1.0 | 2 |

Disappointment Cleaver | 2720 | 15227 | 5.598162 | 4.0 | 12 | 2720 | 8246 | 3.031618 | 2.0 | 12 |

Edmonds HW | 2 | 4 | 2.000000 | 2.0 | 2 | 2 | 0 | 0.000000 | 0.0 | 0 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Tahoma Cleaver | 1 | 3 | 3.000000 | 3.0 | 3 | 1 | 3 | 3.000000 | 3.0 | 3 |

Tahoma Glacier | 11 | 33 | 3.000000 | 3.0 | 5 | 11 | 11 | 1.000000 | 0.0 | 3 |

Wilson Headwall | 2 | 5 | 2.500000 | 2.5 | 3 | 2 | 0 | 0.000000 | 0.0 | 0 |

23 rows × 10 columns

The columns of the output are multi-indexed, as each function is applied to both the ‘Attempted’ and ‘Succeeded’ columns of the DataFrames of each route. For example, the DataFrame lists ‘Disappointment Cleaver’ was:

attempted by 2720 climbing parties,

attempted by a total of 15227 climbers,

a total of 8246 climbers successfully made it to the summit by the route,

the average party size attempting the route was 5.59 climbers.

Notice that the Date column was dropped from the output; Pandas drops non-numeric columns when any of the aggregation methods requires numeric data. Passing a dictionary of aggregations into `agg`

allows for a more tailored set of aggregations:

```
climbs.groupby('Route').agg({'Attempted':['mean', 'max'], 'Date': ['max']})
```

Attempted | Date | ||
---|---|---|---|

mean | max | max | |

Route | |||

Curtis Ridge | 2.000000 | 2 | 6/12/2015 |

Disappointment Cleaver | 5.598162 | 12 | 9/9/2015 |

Edmonds HW | 2.000000 | 2 | 9/14/2014 |

... | ... | ... | ... |

Tahoma Cleaver | 3.000000 | 3 | 6/5/2015 |

Tahoma Glacier | 3.000000 | 5 | 8/2/2014 |

Wilson Headwall | 2.500000 | 3 | 4/26/2015 |

23 rows × 3 columns

### Generic Apply-Combine: `apply`

method¶

The `apply`

method is the most general `GroupBy`

method; it applies any function defined on the ‘grouped’ dataframes and attempts to combined the result into meaningful output. While very flexible, the `apply`

method is slower, less space efficient, and more prone to error than using standard `GroupBy`

methods; it should be avoided when possible.

Use this method when either:

the transformation depends on complicated group-wise conditions on multiple columns.

the output dimensions of the resulting Series/DataFrame are nonstandard.

**Example:** For the dataframe `df`

, generate a dataframe indexed by label `Col1`

, with five of random integers chosen within the range of values of `Col2`

within the group. For example, within the `A`

group `Col2`

contains 1, 3; the output row with index `A`

should therefore contain five randomly chosen integers between 1 and 3.

```
df
```

Col1 | Col2 | Col3 | |
---|---|---|---|

0 | A | 1 | 4 |

1 | B | 2 | 3 |

2 | C | 3 | 1 |

3 | C | 4 | 2 |

4 | B | 2 | 4 |

5 | B | 5 | 5 |

6 | A | 3 | 3 |

Define a function that takes in a group-dataframe (i.e. retrievable by `df.groupby('Col1').get_group(label)`

), calculates the minimum and maximum values of Col2 within that group, and returns the random integers as a Series:

```
def generate_random(gp):
n_rand = 5
m, M = gp['Col2'].min(), gp['Col2'].max()
r = np.random.randint(m, M, size=n_rand)
return pd.Series(r, index=['rand_%d' % (x + 1) for x in range(n_rand)])
```

The `apply`

method combines these Series of random integers into a DataFrame indexed by label:

```
G.apply(generate_random)
```

rand_1 | rand_2 | rand_3 | rand_4 | rand_5 | |
---|---|---|---|---|---|

Col1 | |||||

A | 1 | 2 | 2 | 1 | 1 |

B | 2 | 4 | 2 | 2 | 4 |

C | 3 | 3 | 3 | 3 | 3 |

Another common use case for apply is when the output DataFrame has the same index as the input DataFrame. This often occurs when the elements of a table need to be scaled by a groupwise condition.

**Example (transform):** Scale the DataFrame linearly, so that an entry is represented as the proportion of the range for each group. For example, in ‘Col3’, the ‘B’ group has values 3, 4, 5. The range of the ‘B’ group is (5 - 3) = 2. The value 4 is halfway is in the middle of the range and thus is scaled to 0.5.

```
df
```

Col1 | Col2 | Col3 | |
---|---|---|---|

0 | A | 1 | 4 |

1 | B | 2 | 3 |

2 | C | 3 | 1 |

3 | C | 4 | 2 |

4 | B | 2 | 4 |

5 | B | 5 | 5 |

6 | A | 3 | 3 |

```
def scale_by_range(gp):
rng = gp.max() - gp.min()
from_min = gp - gp.min()
return from_min / rng
```

The function `scale_by_range`

returns a DataFrames of the same index as the input. Pandas thus aligns them by common columns, returning a dataframe of the same shape as `df`

:

```
G.apply(scale_by_range)
```

Col2 | Col3 | |
---|---|---|

0 | 0.0 | 1.0 |

1 | 0.0 | 0.0 |

2 | 0.0 | 0.0 |

3 | 1.0 | 1.0 |

4 | 0.0 | 0.5 |

5 | 1.0 | 1.0 |

6 | 1.0 | 0.0 |

Pandas provides a method specifically for this pattern: `transform`

. It’s appropriate when passing a function to `apply`

that returns a Series/DataFrame with the same index as the input. This method is both faster and less error prone than using `apply`

in this case.

```
G.transform(scale_by_range)
```

Col2 | Col3 | |
---|---|---|

0 | 0.0 | 1.0 |

1 | 0.0 | 0.0 |

2 | 0.0 | 0.0 |

3 | 1.0 | 1.0 |

4 | 0.0 | 0.5 |

5 | 1.0 | 1.0 |

6 | 1.0 | 0.0 |

**Example:** Among the 24 routes up Mt. Rainier, only a few of them see more than 15 attempts in 2014-15. The `apply`

method can be used to filter out all attempts on routes that aren’t very popular (fewer than 15 attempts in 2014-15). The split-apply-combine logic:

splits the table into separate routes on the mountain (split),

checks the size of each table (apply):

leaving the table untouched if it contains at least 15 rows,

otherwise, omitting the rows (returning an empty DataFrame)

Aggregating the (possibly empty) DataFrames into a filtered subtable of the original

`climbs`

.

```
def filter_few(df, min_size=15):
'''return df if it has at least `min_size` rows,
otherwise return an empty dataframe with the same columns'''
if df.shape[0] >= min_size:
return df
else:
return pd.DataFrame(columns=df.columns)
```

```
popular_climbs = (
climbs
.groupby('Route')
.apply(filter_few)
.reset_index(level=0, drop=True) # drop group key: Route
)
popular_climbs.sort_index()
```

Date | Route | Attempted | Succeeded | |
---|---|---|---|---|

0 | 11/27/2015 | Disappointment Cleaver | 2 | 0 |

1 | 11/21/2015 | Disappointment Cleaver | 3 | 0 |

2 | 10/15/2015 | Disappointment Cleaver | 2 | 0 |

... | ... | ... | ... | ... |

3966 | 1/6/2014 | Disappointment Cleaver | 8 | 0 |

3967 | 1/5/2014 | Disappointment Cleaver | 2 | 0 |

3968 | 1/4/2014 | Gibraltar Ledges | 3 | 2 |

3914 rows × 4 columns

The resulting table is a filtered subtable of the original table. To verify all routes contained in `popular_climbs`

have at least 15 attempts, we can calculate the distribution of routes:

```
popular_climbs['Route'].value_counts()
```

```
Disappointment Cleaver 2720
Emmons-Winthrop 631
Kautz Glacier 181
...
Liberty Ridge 69
Ingraham Direct 53
Ptarmigan Ridge 22
Name: Route, Length: 9, dtype: int64
```

**Example:** This pattern is common enough that Pandas provides the `filter`

method to optimize this computation. The `filter`

method takes in a boolean ‘filtering function’ as input. The `filter`

method keeps a grouped DataFrame exactly when the filter function returns `True`

. Repeating the prior example, to filter out all but the popular climbing routes:

```
climbs.groupby('Route').filter(lambda df:df.shape[0] >= 15)
```

Date | Route | Attempted | Succeeded | |
---|---|---|---|---|

0 | 11/27/2015 | Disappointment Cleaver | 2 | 0 |

1 | 11/21/2015 | Disappointment Cleaver | 3 | 0 |

2 | 10/15/2015 | Disappointment Cleaver | 2 | 0 |

... | ... | ... | ... | ... |

3966 | 1/6/2014 | Disappointment Cleaver | 8 | 0 |

3967 | 1/5/2014 | Disappointment Cleaver | 2 | 0 |

3968 | 1/4/2014 | Gibraltar Ledges | 3 | 2 |

3914 rows × 4 columns

*Remark:* The same filtering is possible by a two step process that avoids a `groupby`

, via:

use the empirical distribution of routes to create an array of ‘popular routes’

filter the original

`climbs`

DataFrame using a boolean array created with the`isin`

DataFrame method.

Try this! Of course, calculating an empirical categorical distribution on a Series, using `value_counts()`

, is logically equivalent to splitting by Route and taking the size of each group; this approach is more memory efficient, though slower.