Combining Data: Attributes


Combining different measurements over the same individuals

Adding additional attributes to an existing dataset is often necessary to improve the description of the data generating process and answer the questions being asked of the data. In its most simple form, such a task consists of tables of the same individuals, with different attributes, collected for different purposes.

Such a process simply needs to add the columns of the second table to the first table. However, this process only works when the rows of the two tables correspond to the same individuals. Assuming a one-to-one correspondence between the individuals in the two tables, identified by their index, combining the columns of the two tables requires two steps:

  • sorting each table by their respective indexes (i.e. lining them up),

  • iterating through the (common) index of each table, combining the attributes of each row into a new table.

Remark 1: This procedure is implicitly done by Pandas whenever setting a Series as a new column of an existing DataFrame.

Remark 2: Understanding the internals of this procedure is important for both being able to debug performance issues, as well as implement custom procedures when the correspondence between the rows of the tables is not obvious.

This procedure is known as joining or merging tables (column-wise). The correspondence between the rows (e.g. via a shared index) is called a join key.

Example: Below, two tables contain characteristics of California counties obtained from the US Census: one contains a population attribute, the other contains the median household income. The join key is the ‘County’ column.

Joining data column-wise

Pandas has multiple ways of executing a (column-wise) join of DataFrames. Below outlines each and their relative advantages. In spite of the options, one should use the merge method whenever possible!

Join Technique 1: column(s) assignment The most simple way of joining the columns of two tables along a shared index is by column(s) assignment. If df1 and df2 have indexes in one-to-one correspondence, they can be joined via:

df1[df2.columns] = df2

This technique is simple and space efficient, since column assignment modifies the left-hand DataFrame in-place. However it has many disadvantages:

Advantages

Disadvantages

Space Efficient

Doesn’t work with the method chaining

Joining more than two DataFrames requires multiple lines

Works only when indexes are in one-to-one correspondence

Sorts the two DataFrames by index, even if already aligned

Join Technique 2 (concat): The Pandas function concat column-wise joins a list of DataFrames by a shared index, with several convenient keyword arguments. If df1,...,dfN are DataFrames, then the columns of these DataFrames can be joined by passing them to concat in a list:

pd.concat([df1, ..., dfN], axis=1)

This function has many advantages over column assignment:

Advantages

Disadvantages

Can easily join many DataFrames

Doesn’t work with method chaining

Indexes don’t need to be 1-1 correspondence

The DataFrames must of a common index

Can skip sorting indexes by setting ignore_index=True

Join Technique 3 (merge): The merge method is the most flexible joining technique in Pandas. By default, it joins the columns of two DataFrames on common columns. If df1,...,dfN are DataFrames with shared columns that identify observations between them, the DataFrames can be joined by chaining together successive merge calls:

df1.merge(df2).merge(df3)...merge(dfN)

If the DataFrames have a shared index, then merge is called with the left_index/right_index=True keyword argument.

Advantages

Disadvantages

Works with method chaining

Can’t pass an arbitrary list of DataFrames

Join key doesn’t need to be in index

Can’t skip sorting by the join key

Rows don’t need to be 1-1 correspondence

Example:

counties = pop_counties.merge(inc_counties)
counties
County Population Income
0 Calaveras 44921 54936
1 San Diego 3183143 63996
2 San Joaquin 701050 53253
... ... ... ...
55 Marin 256802 91529
56 Monterey 424927 58582
57 San Mateo 739837 91421

58 rows × 3 columns

Check the relative sizes of the inputs confirm every row has its expected match. The input tables exactly one entry for each of the 58 California counties; the joined table should have one row per county and three columns.

print(pop_counties.shape, inc_counties.shape, counties.shape, sep='\n')
(58, 2)
(58, 2)
(58, 3)

Remark: this simple check can save hours of debugging!

Join types

Not uncommonly, when joining two tables, the observations don’t perfectly align. In these cases, care must be taken with the approach to joining the observations. There are three possibilities:

  1. the observations are in one-to-one correspondence,

  2. one of the tables contains observations not contained in the other,

  3. both of the tables contain observations missing in the other.

In the first case, there is no ambiguity in combining the two tables. However, in the latter two cases, one has to decide whether to retain the observations that are present in only one of the tables. These cases are enumerated in the following join types.

Given two tables, the possible join types of a combined table are:

  • An inner join keeps only the observations with a matched join key,

  • An outer join keeps all observations, even if not-matched,

  • A left join keeps all observations in the left table,

  • A right join keeps all observations in the right table.

Below, the join types are illustrated with Venn diagrams; the diagrams overlap if the tables share an observation: join types

The default operation is typical as inner join, as is true for the merge method.

Example: Suppose the county population table only includes information on counties with populations over 1.5M residents. The inner-join between this partial population table and the full income table will drop income information!

The partial population file only contains seven of the 58 counties:

partial_pop = pop_counties.loc[pop_counties['Population'] > 1.5*10**6]
partial_pop
County Population
1 San Diego 3183143
9 Riverside 2266899
18 Santa Clara 1841569
24 San Bernardino 2078586
37 Los Angeles 9974203
50 Alameda 1559308
53 Orange 3086331

The inner-join only contains the seven large population counties, even though the income table contains all 58 counties:

partial_pop.merge(inc_counties)
County Population Income
0 San Diego 3183143 63996
1 Riverside 2266899 56592
2 Santa Clara 1841569 93854
3 San Bernardino 2078586 54100
4 Los Angeles 9974203 55870
5 Alameda 1559308 73775
6 Orange 3086331 75998

A right-join retains all the income values and adds population entries when available; a NaN value appears when a corresponding county doesn’t appear in the population table:

partial_pop.merge(inc_counties, how='right')
County Population Income
0 San Diego 3183143.0 63996
1 Riverside 2266899.0 56592
2 Santa Clara 1841569.0 93854
... ... ... ...
55 San Benito NaN 67874
56 Yolo NaN 55508
57 Lassen NaN 53351

58 rows × 3 columns

Example: Joins in Pandas are exact matches between join keys, including a match on the data type. For example, columns with integer columns, with one of them typed as a string, will fail to properly join. These mistakes are hard to spot, as Pandas hides the type of the contents of DataFrames on formating.

Defining two dataframes with matching indexes of different types:

df_ints = pd.DataFrame({'B': range(0, 6, 2)}, index=[0, 1, 2])
df_strs = pd.DataFrame({'C': range(0, 12, 4)}, index='0 1 2'.split())

Joining the two DataFrames with an outer-join illustrates the failure to join on index:

merged = df_ints.merge(df_strs, right_index=True, left_index=True, how='outer')
merged
B C
0 0.0 NaN
1 2.0 NaN
2 4.0 NaN
0 NaN 0.0
1 NaN 4.0
2 NaN 8.0

The index of the merged table contains distinct values of both integer and string type:

merged.index
Index([0, 1, 2, '0', '1', '2'], dtype='object')

Example: By default, merge sets the join-key to the set of all common columns between the DataFrames being merged, sometimes leading to unintended consequences when both tables share similar attributes. The join-key can be specified explicitly using the keyword on.

The DataFrames df1 and df2 have different resulting joins when the join key is column ‘A’ vs. columns (‘A’, ‘B’):

df1 = pd.DataFrame({'A': range(3), 'B': range(0, 6, 2)})
df2 = pd.DataFrame({'A': range(3), 'B': range(0, 12, 4)})
df1.merge(df2)
A B
0 0 0
df1.merge(df2, on='A', suffixes=['_1','_2'])
A B_1 B_2
0 0 0 0
1 1 2 4
2 2 4 8

Remark: The suffixes argument decorates the duplicate column names with which table the column originated from.

Example: When joining new attributes to a dataset under investigation, the additional information isn’t always available for the complete population. For example, when investigating the incomes by California county, suppose one attempts to join demographic information (such as population size) to the incomes dataset. However, if this demographic information is only available for counties with a large population, the merge dataset will exhibit significant bias:

large_pop = pop_counties.loc[pop_counties['Population'] > 200000]

Upon merging the incomes with the demographics, an inner-join drops all counties of small size. The resulting average income is quite different than the average income across all counties:

inc_counties.merge(large_pop)['Income'].mean()
63668.57142857143
inc_counties['Income'].mean()
56034.362068965514

By using an left-join, on can avoid dropping these observations (at the expense of introducing missing values into the merged dataset):

inc_counties.merge(large_pop, how='left')['Income'].mean()
56034.362068965514

Many-to-one joins

A join performed on a join-key with duplicate entries results in either a ‘many-to-one’ or a ‘many-to-many’ join. Though these joins occur naturally, the resulting datasets might initially seem unintuitive.

Many-to-one joins occur naturally when using ‘lookup tables’ to attach richer information about an attribute of a dataset (as opposed to an individual). For example, a table might include the purchase history of customers subscribing to a music-streaming service:

Customer

Plan Type

23523

Premium

43453

Premium

34523

Student

98345

Family

13423

Premium

34234

Student

Computing the total revenue generated from these purchases, one needs to merge this table with the three-row table of plan types:

Plan Type

Price

Student

$4.99

Premium

$9.99

Family

$14.99

A few observations on how this join differs from the simple joins covered before:

  • the join-key ‘Plan Type’ is an attribute of the table, not an identifier for the observations of the tables.

  • one expects the resulting joined table to contain the price next to each customer purchase; the each row in the small table is repeated many times over.

The result of joining these two tables is:

Customer

Plan Type

Price

23523

Premium

$9.99

43453

Premium

$9.99

34523

Student

$4.99

98345

Family

$14.99

13423

Premium

$9.99

34234

Student

$4.99

Example: The dataset cities contains population and income information on 1442 California cities, along with the name of the county to which each city belongs. Using this dataset, one can roughly calculate which cities make significantly more or less than their neighbors: create a derived attribute using the ratio of the average household income of a given city to the average household income of the county in which it lies.

cities
City County Population Income
0 Acalanes Ridge Contra Costa 1226 160000.0
1 Acampo San Joaquin 776 141250.0
2 Acton Los Angeles 6956 92245.0
... ... ... ... ...
1439 Yucaipa San Bernardino 52406 58506.0
1440 Yucca Valley San Bernardino 21083 43086.0
1441 Zayante Santa Cruz 853 81308.0

1442 rows × 4 columns

Calculating this attribute requires joining the county dataset with the city dataset on the ‘County’ attribute. Since possibly many cities belong to the same county, this is a many-to-one join.

cities_with_county_info = cities.merge(counties, on='County', suffixes=['_city', '_county'])
cities_with_county_info
City County Population_city Income_city Population_county Income_county
0 Acalanes Ridge Contra Costa 1226 160000.0 1081232 79799
1 Alamo Contra Costa 15639 163151.0 1081232 79799
2 Alhambra Valley Contra Costa 499 62000.0 1081232 79799
... ... ... ... ... ... ...
1435 San Juan Bautista San Benito 2219 57717.0 56888 67874
1436 Tres Pinos San Benito 457 75893.0 56888 67874
1437 San Francisco San Francisco 829072 78378.0 829072 78378

1438 rows × 6 columns

Remark: What happens if the join key were not specified in the merge? What condition must be met for a city to belong to the merged table?

Once merged, the ‘income ratio’ attribute is simply the ratio of two columns. The cities with the highest income ratios are small gated communities within Los Angeles County:

(
    cities_with_county_info.assign(
        income_ratio=cities_with_county_info['Income_city'] / cities_with_county_info['Income_county']
    ).sort_values(by='income_ratio', ascending=False)
)
City County Population_city Income_city Population_county Income_county income_ratio
134 Hidden Hills Los Angeles 1749 245694.0 9974203 55870 4.397602
177 Rolling Hills Los Angeles 1689 218583.0 9974203 55870 3.912350
169 Palos Verdes Estates Los Angeles 13568 171328.0 9974203 55870 3.066547
... ... ... ... ... ... ... ...
595 Boulevard San Diego 434 12421.0 3183143 63996 0.194090
1301 Valley Ford Sonoma 167 9312.0 491790 63799 0.145958
526 Verdi Sierra 92 4853.0 3019 43107 0.112580

1438 rows × 7 columns

Many-to-many joins

Recall that, assuming a one-to-one correspondence between the individuals in the two tables, joining the columns of the two tables requires two steps:

  1. sorting each table by their respective join-keys (i.e. lining them up),

  2. iterating through the (common, corresponding) rows of each table, combining the attributes of each row into a new table.

Now, consider if one, or both, of the tables contains duplicate values in the column specified as the join-key. How does step 2 change? Sketch out an algorithm. How many rows does the resulting joined table have?

Broadcast join

When one of the tables being joined is small, it’s faster to use a broadcast join. Broadcast joins use the following procedure:

  1. load the small table into a dictionary \(D\), keyed by the join-key.

  2. iterate through the large table; for each row \(r\) in the table:

    1. access the join key \(k\) for that row,

    2. get the value \(v = D[k]\) corresponding to this key,

    3. append the value \(v\) from the dictionary to the row \(r\).

This procedure is fast:

  • it doesn’t require sorting either table,

  • dictionary lookups only require constant time,

So this sorting algorithm runs in a time proportional to the sum of the lengths of the two tables.

Example: Given the two tables of streaming music purchases (purchases), and the prices of each plan type (prices), one can use a broadcast join to combine the two tables on ‘Plan Type’:

D = prices.set_index('Plan Type').to_dict()
purchases['Price'] = purchases.apply(lambda x:D.get(x['Plan Type']))

Approximate join