{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"pd.set_option('display.max_rows', 7)\n",
"\n",
"import warnings\n",
"warnings.filterwarnings('ignore')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modifying DataFrames\n",
"\n",
"Methods for modifying existing Pandas dataframes fall into two categories:\n",
"* chaining together modified copies of an existing dataframe, and\n",
"* modifying the data contained in an existing dataframe in-place.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mutating DataFrames using `assign`\n",
"\n",
"The `assign` method returns a copy of a dataframe, augmented by the column(s) specified in the method. `assign` fits into the functional 'method chaining' paradigm of table manipulation: easy to understand, free of side-effects, and easily generalized to distributed data processing libraries. However, when used in Pandas, `assign` returns a copy of the dataframe, making it a poor choice for manipulating a large dataset on a single processor.\n",
"\n",
"Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n",
"```\n",
"df.assign(newcol1=series1, ..., newcolN=seriesN)\n",
"```\n",
"returns a copy of `df` with the `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Example:** Given dataframe of random integers, with columns:\n",
"* `rand_big` containing integers 0-9 inclusive,\n",
"* `rand_small` containing integers 0-2 inclusive"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small\n",
"0 8 1\n",
"1 0 0\n",
"2 5 1\n",
"3 6 1\n",
"4 9 1"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rand_df = pd.DataFrame({\n",
" 'rand_big': np.random.randint(10,size=5), \n",
" 'rand_small': np.random.randint(3, size=5)\n",
"})\n",
"rand_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
" big_is_div3 | \n",
" small_is_even | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small big_is_div3 small_is_even\n",
"0 8 1 False False\n",
"1 0 0 True True\n",
"2 5 1 False False\n",
"3 6 1 True False\n",
"4 9 1 True False"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rand_df.assign(\n",
" big_is_div3=(rand_df['rand_big'] % 3 == 0), \n",
" small_is_even=(rand_df['rand_small'] % 2 == 0)\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, the names of the columns can be passed in the keys of a keyword dictionary. This is useful when the new column names contain spaces or special characters:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
" big_is_div3 | \n",
" small_is_even | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small big_is_div3 small_is_even\n",
"0 8 1 False False\n",
"1 0 0 True True\n",
"2 5 1 False False\n",
"3 6 1 True False\n",
"4 9 1 True False"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_cols = {\n",
" 'big_is_div3': rand_df['rand_big'] % 3 == 0,\n",
" 'small_is_even': rand_df['rand_small'] % 2 == 0\n",
"}\n",
"\n",
"rand_df.assign(**new_cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When building a new dataframe from derived columns as above, a common pattern is to assign new columns to an empty dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" big_is_div3 | \n",
" small_is_even | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" big_is_div3 small_is_even\n",
"0 False False\n",
"1 True True\n",
"2 False False\n",
"3 True False\n",
"4 True False"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame().assign(**new_cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Modifying an existing dataframe\n",
"\n",
"A Pandas dataframe is a columnar data store where each column is a Series keyed by its column name. Just as columns (and subtables) selected using `[]` and `loc`, they can be also be set/modified by reassigning the data associated to a given key. While this approach is the most common way to modify a dataframe, it usually isn't the best way to modify them; using `assign` is the better choice when possible.\n",
"\n",
"Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n",
"```\n",
"df['newcol1'] = series1 \n",
"... \n",
"df['newcolN'] = seriesN\n",
"```\n",
"\n",
"modifies `df` by adding `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes.\n",
"\n",
"\n",
"*Remark:* Modifying a dataframe via column/index reassignment is the most time/space efficient procedure on a single processor. However it has many disadvantages worth understanding. Reassignment\n",
"* encourages procedural code that is hard to maintain and extend (unlike method chaining),\n",
"* often has side-effects on existing dataframes that cause hard-to-find bugs. This is especially true when working in Jupyter notebooks, as re-running previous code on a modified dataframe may lead to unintended output.\n",
"* obfuscates the data processing logic and makes it harder to translate the code into a distributed environment."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Example:** To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
" big_is_div3 | \n",
" small_is_even | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small big_is_div3 small_is_even\n",
"0 8 1 False False\n",
"1 0 0 True True\n",
"2 5 1 False False\n",
"3 6 1 True False\n",
"4 9 1 True False"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rand_df['big_is_div3'] = rand_df['rand_big'] % 3 == 0\n",
"rand_df['small_is_even'] = rand_df['rand_small'] % 2 == 0\n",
"\n",
"rand_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Modifying subtables with `loc`\n",
"\n",
"One can also (re-)assign columns using `loc`. Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n",
"```\n",
"df.loc[:, 'newcol1'] = series1 \n",
"... \n",
"df.loc[:, 'newcolN'] = seriesN\n",
"```\n",
"\n",
"modifies `df` by adding `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes.\n",
"\n",
"**Example:** To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
" big_is_div3 | \n",
" small_is_even | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" True | \n",
" True | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" True | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small big_is_div3 small_is_even\n",
"0 8 1 False False\n",
"1 0 0 True True\n",
"2 5 1 False False\n",
"3 6 1 True False\n",
"4 9 1 True False"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rand_df.loc[:, 'big_is_div3'] = rand_df['rand_big'] % 3 == 0\n",
"rand_df.loc[:, 'small_is_even'] = rand_df['rand_small'] % 2 == 0\n",
"\n",
"rand_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Remark:* Note, this code actually *reassigns* the two new columns, as we modifying the original dataframe in the previous example!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modify tables with `loc` actually allows to modify subsets of rows as well. Generally, just as one can selecting general subtables using `loc` and boolean arrays, one can also reassign those selected values. For example, to add a column `small_number_type` that labels each value of `rand_small` as either `Even` or `Odd`:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" rand_big | \n",
" rand_small | \n",
" big_is_div3 | \n",
" small_is_even | \n",
" small_number_type | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 8 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" Odd | \n",
"
\n",
" \n",
" 1 | \n",
" 0 | \n",
" 0 | \n",
" True | \n",
" True | \n",
" Even | \n",
"
\n",
" \n",
" 2 | \n",
" 5 | \n",
" 1 | \n",
" False | \n",
" False | \n",
" Odd | \n",
"
\n",
" \n",
" 3 | \n",
" 6 | \n",
" 1 | \n",
" True | \n",
" False | \n",
" Odd | \n",
"
\n",
" \n",
" 4 | \n",
" 9 | \n",
" 1 | \n",
" True | \n",
" False | \n",
" Odd | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" rand_big rand_small big_is_div3 small_is_even small_number_type\n",
"0 8 1 False False Odd\n",
"1 0 0 True True Even\n",
"2 5 1 False False Odd\n",
"3 6 1 True False Odd\n",
"4 9 1 True False Odd"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# copy rand_small data to a new column\n",
"rand_df.loc[:, 'small_number_type'] = rand_df['rand_small']\n",
"\n",
"# boolean arrays for selecting even/odd entries\n",
"evens_bool = rand_df['rand_small'] % 2 == 0\n",
"odds_bool = rand_df['rand_small'] % 2 == 1\n",
"\n",
"# reassign even/odd entries to the values 'Even'/'Odd'\n",
"rand_df.loc[evens_bool, 'small_number_type'] = 'Even'\n",
"rand_df.loc[odds_bool, 'small_number_type'] = 'Odd'\n",
"\n",
"rand_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Warning: chained assignment\n",
"\n",
"Modifying a dataframe through reassignment sometimes leads to unexpected results, especially when part of a long data cleaning process. The root of these ambiguities come from whether certain Pandas operations return copy or a reference to the underlying object being modified. This behavior is context dependent (e.g. it may depend on the data-types present in the dataframe.\n",
"\n",
"Unintentionally setting new values on a copy dataframe, instead of a reference, leads to unexpected results. This most commonly happens when performing *chained assignment* -- reassigning values to a dataframe created from multiple indexing operations.\n",
"\n",
"Pandas will return a `SettingWithCopyWarning` when attempting to reassign the values of a dataframe that is already a copy of a dataframe. Such code is ambiguous: is the reassignment intended for solely the copy or the original dataframe as well? Regardless, such code is unclear and `assign` should be used if possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Example:** The different chained assignments of the table `numbers` illustrate the dangers of chained assignment."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" odd | \n",
" 1 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 2 4 6\n",
"1 odd 1 3 5"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers = pd.DataFrame([['even', 2, 4, 6], ['odd', 1, 3, 5]], columns=['type', 'A', 'B', 'C'])\n",
"numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first `.loc` returns a copy; the reassignment is performed on a copy that is not saved to a variable. This results in the value remaining unchanged."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" odd | \n",
" 1 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 2 4 6\n",
"1 odd 1 3 5"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers_modify1 = numbers.copy()\n",
"\n",
"numbers_modify1.loc[1].loc['B'] = 1\n",
"numbers_modify1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Removing the chained assignment returns the expected result: the 3 is changed to a 1."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" odd | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 2 4 6\n",
"1 odd 1 1 5"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers_modify3 = numbers.copy()\n",
"\n",
"numbers_modify3.loc[1, 'B'] = 1\n",
"numbers_modify3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For performance reasons, Pandas often returns a reference instead of a copy when the underlying DataFrame has homogeneous data-type. This leads to different results from the previous illustration! After dropping the `object` type column in `numbers`, chained assignment manages to give to the desired result."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 2 4 6\n",
"1 1 1 5"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"without_type = numbers.drop('type', axis=1) # drop returns a copy!\n",
"without_type.loc[1].loc['B'] = 1\n",
"without_type"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Example:** Chained assignment usually appears when reassignment to on a subtable created through more complicated processing. Selecting a subtable creates a copy; modifying that subtable doesn't affect the original table."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 2 4 6"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"firstrow = numbers.loc[numbers['C'] == 6]\n",
"firstrow"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 100 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 100 4 6"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"firstrow.loc[0, 'A'] = 100\n",
"firstrow"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" type | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" even | \n",
" 2 | \n",
" 4 | \n",
" 6 | \n",
"
\n",
" \n",
" 1 | \n",
" odd | \n",
" 1 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" type A B C\n",
"0 even 2 4 6\n",
"1 odd 1 3 5"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}