Modifying DataFrames
Contents
Modifying DataFrames¶
Methods for modifying existing Pandas dataframes fall into two categories:
chaining together modified copies of an existing dataframe, and
modifying the data contained in an existing dataframe in-place.
Mutating DataFrames using assign
¶
The assign
method returns a copy of a dataframe, augmented by the column(s) specified in the method. assign
fits into the functional ‘method chaining’ paradigm of table manipulation: easy to understand, free of side-effects, and easily generalized to distributed data processing libraries. However, when used in Pandas, assign
returns a copy of the dataframe, making it a poor choice for manipulating a large dataset on a single processor.
Given an input dataframe df
and N
Series series1, ..., seriesN
:
df.assign(newcol1=series1, ..., newcolN=seriesN)
returns a copy of df
with the N
additional columns labeled as newcol1, ..., newcolN
. The new columns are aligned by their indexes.
Example: Given dataframe of random integers, with columns:
rand_big
containing integers 0-9 inclusive,rand_small
containing integers 0-2 inclusive
rand_df = pd.DataFrame({
'rand_big': np.random.randint(10,size=5),
'rand_small': np.random.randint(3, size=5)
})
rand_df
rand_big | rand_small | |
---|---|---|
0 | 8 | 1 |
1 | 0 | 0 |
2 | 5 | 1 |
3 | 6 | 1 |
4 | 9 | 1 |
To append boolean columns to rand_df
that specifies if a given entry of rand_big
is divisible by 3 and if a given entry of rand_small
is even:
rand_df.assign(
big_is_div3=(rand_df['rand_big'] % 3 == 0),
small_is_even=(rand_df['rand_small'] % 2 == 0)
)
rand_big | rand_small | big_is_div3 | small_is_even | |
---|---|---|---|---|
0 | 8 | 1 | False | False |
1 | 0 | 0 | True | True |
2 | 5 | 1 | False | False |
3 | 6 | 1 | True | False |
4 | 9 | 1 | True | False |
Alternatively, the names of the columns can be passed in the keys of a keyword dictionary. This is useful when the new column names contain spaces or special characters:
new_cols = {
'big_is_div3': rand_df['rand_big'] % 3 == 0,
'small_is_even': rand_df['rand_small'] % 2 == 0
}
rand_df.assign(**new_cols)
rand_big | rand_small | big_is_div3 | small_is_even | |
---|---|---|---|---|
0 | 8 | 1 | False | False |
1 | 0 | 0 | True | True |
2 | 5 | 1 | False | False |
3 | 6 | 1 | True | False |
4 | 9 | 1 | True | False |
When building a new dataframe from derived columns as above, a common pattern is to assign new columns to an empty dataframe:
pd.DataFrame().assign(**new_cols)
big_is_div3 | small_is_even | |
---|---|---|
0 | False | False |
1 | True | True |
2 | False | False |
3 | True | False |
4 | True | False |
Modifying an existing dataframe¶
A Pandas dataframe is a columnar data store where each column is a Series keyed by its column name. Just as columns (and subtables) selected using []
and loc
, they can be also be set/modified by reassigning the data associated to a given key. While this approach is the most common way to modify a dataframe, it usually isn’t the best way to modify them; using assign
is the better choice when possible.
Given an input dataframe df
and N
Series series1, ..., seriesN
:
df['newcol1'] = series1
...
df['newcolN'] = seriesN
modifies df
by adding N
additional columns labeled as newcol1, ..., newcolN
. The new columns are aligned by their indexes.
Remark: Modifying a dataframe via column/index reassignment is the most time/space efficient procedure on a single processor. However it has many disadvantages worth understanding. Reassignment
encourages procedural code that is hard to maintain and extend (unlike method chaining),
often has side-effects on existing dataframes that cause hard-to-find bugs. This is especially true when working in Jupyter notebooks, as re-running previous code on a modified dataframe may lead to unintended output.
obfuscates the data processing logic and makes it harder to translate the code into a distributed environment.
Example: To append boolean columns to rand_df
that specifies if a given entry of rand_big
is divisible by 3 and if a given entry of rand_small
is even:
rand_df['big_is_div3'] = rand_df['rand_big'] % 3 == 0
rand_df['small_is_even'] = rand_df['rand_small'] % 2 == 0
rand_df
rand_big | rand_small | big_is_div3 | small_is_even | |
---|---|---|---|---|
0 | 8 | 1 | False | False |
1 | 0 | 0 | True | True |
2 | 5 | 1 | False | False |
3 | 6 | 1 | True | False |
4 | 9 | 1 | True | False |
Modifying subtables with loc
¶
One can also (re-)assign columns using loc
. Given an input dataframe df
and N
Series series1, ..., seriesN
:
df.loc[:, 'newcol1'] = series1
...
df.loc[:, 'newcolN'] = seriesN
modifies df
by adding N
additional columns labeled as newcol1, ..., newcolN
. The new columns are aligned by their indexes.
Example: To append boolean columns to rand_df
that specifies if a given entry of rand_big
is divisible by 3 and if a given entry of rand_small
is even:
rand_df.loc[:, 'big_is_div3'] = rand_df['rand_big'] % 3 == 0
rand_df.loc[:, 'small_is_even'] = rand_df['rand_small'] % 2 == 0
rand_df
rand_big | rand_small | big_is_div3 | small_is_even | |
---|---|---|---|---|
0 | 8 | 1 | False | False |
1 | 0 | 0 | True | True |
2 | 5 | 1 | False | False |
3 | 6 | 1 | True | False |
4 | 9 | 1 | True | False |
Remark: Note, this code actually reassigns the two new columns, as we modifying the original dataframe in the previous example!
Modify tables with loc
actually allows to modify subsets of rows as well. Generally, just as one can selecting general subtables using loc
and boolean arrays, one can also reassign those selected values. For example, to add a column small_number_type
that labels each value of rand_small
as either Even
or Odd
:
# copy rand_small data to a new column
rand_df.loc[:, 'small_number_type'] = rand_df['rand_small']
# boolean arrays for selecting even/odd entries
evens_bool = rand_df['rand_small'] % 2 == 0
odds_bool = rand_df['rand_small'] % 2 == 1
# reassign even/odd entries to the values 'Even'/'Odd'
rand_df.loc[evens_bool, 'small_number_type'] = 'Even'
rand_df.loc[odds_bool, 'small_number_type'] = 'Odd'
rand_df
rand_big | rand_small | big_is_div3 | small_is_even | small_number_type | |
---|---|---|---|---|---|
0 | 8 | 1 | False | False | Odd |
1 | 0 | 0 | True | True | Even |
2 | 5 | 1 | False | False | Odd |
3 | 6 | 1 | True | False | Odd |
4 | 9 | 1 | True | False | Odd |
Warning: chained assignment¶
Modifying a dataframe through reassignment sometimes leads to unexpected results, especially when part of a long data cleaning process. The root of these ambiguities come from whether certain Pandas operations return copy or a reference to the underlying object being modified. This behavior is context dependent (e.g. it may depend on the data-types present in the dataframe.
Unintentionally setting new values on a copy dataframe, instead of a reference, leads to unexpected results. This most commonly happens when performing chained assignment – reassigning values to a dataframe created from multiple indexing operations.
Pandas will return a SettingWithCopyWarning
when attempting to reassign the values of a dataframe that is already a copy of a dataframe. Such code is ambiguous: is the reassignment intended for solely the copy or the original dataframe as well? Regardless, such code is unclear and assign
should be used if possible.
Example: The different chained assignments of the table numbers
illustrate the dangers of chained assignment.
numbers = pd.DataFrame([['even', 2, 4, 6], ['odd', 1, 3, 5]], columns=['type', 'A', 'B', 'C'])
numbers
type | A | B | C | |
---|---|---|---|---|
0 | even | 2 | 4 | 6 |
1 | odd | 1 | 3 | 5 |
The first .loc
returns a copy; the reassignment is performed on a copy that is not saved to a variable. This results in the value remaining unchanged.
numbers_modify1 = numbers.copy()
numbers_modify1.loc[1].loc['B'] = 1
numbers_modify1
type | A | B | C | |
---|---|---|---|---|
0 | even | 2 | 4 | 6 |
1 | odd | 1 | 3 | 5 |
Removing the chained assignment returns the expected result: the 3 is changed to a 1.
numbers_modify3 = numbers.copy()
numbers_modify3.loc[1, 'B'] = 1
numbers_modify3
type | A | B | C | |
---|---|---|---|---|
0 | even | 2 | 4 | 6 |
1 | odd | 1 | 1 | 5 |
For performance reasons, Pandas often returns a reference instead of a copy when the underlying DataFrame has homogeneous data-type. This leads to different results from the previous illustration! After dropping the object
type column in numbers
, chained assignment manages to give to the desired result.
without_type = numbers.drop('type', axis=1) # drop returns a copy!
without_type.loc[1].loc['B'] = 1
without_type
A | B | C | |
---|---|---|---|
0 | 2 | 4 | 6 |
1 | 1 | 1 | 5 |
Example: Chained assignment usually appears when reassignment to on a subtable created through more complicated processing. Selecting a subtable creates a copy; modifying that subtable doesn’t affect the original table.
firstrow = numbers.loc[numbers['C'] == 6]
firstrow
type | A | B | C | |
---|---|---|---|---|
0 | even | 2 | 4 | 6 |
firstrow.loc[0, 'A'] = 100
firstrow
type | A | B | C | |
---|---|---|---|---|
0 | even | 100 | 4 | 6 |
numbers
type | A | B | C | |
---|---|---|---|---|
0 | even | 2 | 4 | 6 |
1 | odd | 1 | 3 | 5 |