{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "pd.set_option('display.max_rows', 7)\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modifying DataFrames\n", "\n", "Methods for modifying existing Pandas dataframes fall into two categories:\n", "* chaining together modified copies of an existing dataframe, and\n", "* modifying the data contained in an existing dataframe in-place.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mutating DataFrames using `assign`\n", "\n", "The `assign` method returns a copy of a dataframe, augmented by the column(s) specified in the method. `assign` fits into the functional 'method chaining' paradigm of table manipulation: easy to understand, free of side-effects, and easily generalized to distributed data processing libraries. However, when used in Pandas, `assign` returns a copy of the dataframe, making it a poor choice for manipulating a large dataset on a single processor.\n", "\n", "Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n", "```\n", "df.assign(newcol1=series1, ..., newcolN=seriesN)\n", "```\n", "returns a copy of `df` with the `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** Given dataframe of random integers, with columns:\n", "* `rand_big` containing integers 0-9 inclusive,\n", "* `rand_small` containing integers 0-2 inclusive" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_small
081
100
251
361
491
\n", "
" ], "text/plain": [ " rand_big rand_small\n", "0 8 1\n", "1 0 0\n", "2 5 1\n", "3 6 1\n", "4 9 1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rand_df = pd.DataFrame({\n", " 'rand_big': np.random.randint(10,size=5), \n", " 'rand_small': np.random.randint(3, size=5)\n", "})\n", "rand_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_smallbig_is_div3small_is_even
081FalseFalse
100TrueTrue
251FalseFalse
361TrueFalse
491TrueFalse
\n", "
" ], "text/plain": [ " rand_big rand_small big_is_div3 small_is_even\n", "0 8 1 False False\n", "1 0 0 True True\n", "2 5 1 False False\n", "3 6 1 True False\n", "4 9 1 True False" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rand_df.assign(\n", " big_is_div3=(rand_df['rand_big'] % 3 == 0), \n", " small_is_even=(rand_df['rand_small'] % 2 == 0)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, the names of the columns can be passed in the keys of a keyword dictionary. This is useful when the new column names contain spaces or special characters:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_smallbig_is_div3small_is_even
081FalseFalse
100TrueTrue
251FalseFalse
361TrueFalse
491TrueFalse
\n", "
" ], "text/plain": [ " rand_big rand_small big_is_div3 small_is_even\n", "0 8 1 False False\n", "1 0 0 True True\n", "2 5 1 False False\n", "3 6 1 True False\n", "4 9 1 True False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_cols = {\n", " 'big_is_div3': rand_df['rand_big'] % 3 == 0,\n", " 'small_is_even': rand_df['rand_small'] % 2 == 0\n", "}\n", "\n", "rand_df.assign(**new_cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When building a new dataframe from derived columns as above, a common pattern is to assign new columns to an empty dataframe:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
big_is_div3small_is_even
0FalseFalse
1TrueTrue
2FalseFalse
3TrueFalse
4TrueFalse
\n", "
" ], "text/plain": [ " big_is_div3 small_is_even\n", "0 False False\n", "1 True True\n", "2 False False\n", "3 True False\n", "4 True False" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame().assign(**new_cols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modifying an existing dataframe\n", "\n", "A Pandas dataframe is a columnar data store where each column is a Series keyed by its column name. Just as columns (and subtables) selected using `[]` and `loc`, they can be also be set/modified by reassigning the data associated to a given key. While this approach is the most common way to modify a dataframe, it usually isn't the best way to modify them; using `assign` is the better choice when possible.\n", "\n", "Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n", "```\n", "df['newcol1'] = series1 \n", "... \n", "df['newcolN'] = seriesN\n", "```\n", "\n", "modifies `df` by adding `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes.\n", "\n", "\n", "*Remark:* Modifying a dataframe via column/index reassignment is the most time/space efficient procedure on a single processor. However it has many disadvantages worth understanding. Reassignment\n", "* encourages procedural code that is hard to maintain and extend (unlike method chaining),\n", "* often has side-effects on existing dataframes that cause hard-to-find bugs. This is especially true when working in Jupyter notebooks, as re-running previous code on a modified dataframe may lead to unintended output.\n", "* obfuscates the data processing logic and makes it harder to translate the code into a distributed environment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_smallbig_is_div3small_is_even
081FalseFalse
100TrueTrue
251FalseFalse
361TrueFalse
491TrueFalse
\n", "
" ], "text/plain": [ " rand_big rand_small big_is_div3 small_is_even\n", "0 8 1 False False\n", "1 0 0 True True\n", "2 5 1 False False\n", "3 6 1 True False\n", "4 9 1 True False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rand_df['big_is_div3'] = rand_df['rand_big'] % 3 == 0\n", "rand_df['small_is_even'] = rand_df['rand_small'] % 2 == 0\n", "\n", "rand_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Modifying subtables with `loc`\n", "\n", "One can also (re-)assign columns using `loc`. Given an input dataframe `df` and `N` Series `series1, ..., seriesN`:\n", "```\n", "df.loc[:, 'newcol1'] = series1 \n", "... \n", "df.loc[:, 'newcolN'] = seriesN\n", "```\n", "\n", "modifies `df` by adding `N` additional columns labeled as `newcol1, ..., newcolN`. The new columns are aligned by their indexes.\n", "\n", "**Example:** To append boolean columns to `rand_df` that specifies if a given entry of `rand_big` is divisible by 3 and if a given entry of `rand_small` is even:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_smallbig_is_div3small_is_even
081FalseFalse
100TrueTrue
251FalseFalse
361TrueFalse
491TrueFalse
\n", "
" ], "text/plain": [ " rand_big rand_small big_is_div3 small_is_even\n", "0 8 1 False False\n", "1 0 0 True True\n", "2 5 1 False False\n", "3 6 1 True False\n", "4 9 1 True False" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rand_df.loc[:, 'big_is_div3'] = rand_df['rand_big'] % 3 == 0\n", "rand_df.loc[:, 'small_is_even'] = rand_df['rand_small'] % 2 == 0\n", "\n", "rand_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Remark:* Note, this code actually *reassigns* the two new columns, as we modifying the original dataframe in the previous example!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modify tables with `loc` actually allows to modify subsets of rows as well. Generally, just as one can selecting general subtables using `loc` and boolean arrays, one can also reassign those selected values. For example, to add a column `small_number_type` that labels each value of `rand_small` as either `Even` or `Odd`:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
rand_bigrand_smallbig_is_div3small_is_evensmall_number_type
081FalseFalseOdd
100TrueTrueEven
251FalseFalseOdd
361TrueFalseOdd
491TrueFalseOdd
\n", "
" ], "text/plain": [ " rand_big rand_small big_is_div3 small_is_even small_number_type\n", "0 8 1 False False Odd\n", "1 0 0 True True Even\n", "2 5 1 False False Odd\n", "3 6 1 True False Odd\n", "4 9 1 True False Odd" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# copy rand_small data to a new column\n", "rand_df.loc[:, 'small_number_type'] = rand_df['rand_small']\n", "\n", "# boolean arrays for selecting even/odd entries\n", "evens_bool = rand_df['rand_small'] % 2 == 0\n", "odds_bool = rand_df['rand_small'] % 2 == 1\n", "\n", "# reassign even/odd entries to the values 'Even'/'Odd'\n", "rand_df.loc[evens_bool, 'small_number_type'] = 'Even'\n", "rand_df.loc[odds_bool, 'small_number_type'] = 'Odd'\n", "\n", "rand_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Warning: chained assignment\n", "\n", "Modifying a dataframe through reassignment sometimes leads to unexpected results, especially when part of a long data cleaning process. The root of these ambiguities come from whether certain Pandas operations return copy or a reference to the underlying object being modified. This behavior is context dependent (e.g. it may depend on the data-types present in the dataframe.\n", "\n", "Unintentionally setting new values on a copy dataframe, instead of a reference, leads to unexpected results. This most commonly happens when performing *chained assignment* -- reassigning values to a dataframe created from multiple indexing operations.\n", "\n", "Pandas will return a `SettingWithCopyWarning` when attempting to reassign the values of a dataframe that is already a copy of a dataframe. Such code is ambiguous: is the reassignment intended for solely the copy or the original dataframe as well? Regardless, such code is unclear and `assign` should be used if possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** The different chained assignments of the table `numbers` illustrate the dangers of chained assignment." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even246
1odd135
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 2 4 6\n", "1 odd 1 3 5" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers = pd.DataFrame([['even', 2, 4, 6], ['odd', 1, 3, 5]], columns=['type', 'A', 'B', 'C'])\n", "numbers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first `.loc` returns a copy; the reassignment is performed on a copy that is not saved to a variable. This results in the value remaining unchanged." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even246
1odd135
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 2 4 6\n", "1 odd 1 3 5" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers_modify1 = numbers.copy()\n", "\n", "numbers_modify1.loc[1].loc['B'] = 1\n", "numbers_modify1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Removing the chained assignment returns the expected result: the 3 is changed to a 1." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even246
1odd115
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 2 4 6\n", "1 odd 1 1 5" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers_modify3 = numbers.copy()\n", "\n", "numbers_modify3.loc[1, 'B'] = 1\n", "numbers_modify3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For performance reasons, Pandas often returns a reference instead of a copy when the underlying DataFrame has homogeneous data-type. This leads to different results from the previous illustration! After dropping the `object` type column in `numbers`, chained assignment manages to give to the desired result." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABC
0246
1115
\n", "
" ], "text/plain": [ " A B C\n", "0 2 4 6\n", "1 1 1 5" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "without_type = numbers.drop('type', axis=1) # drop returns a copy!\n", "without_type.loc[1].loc['B'] = 1\n", "without_type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** Chained assignment usually appears when reassignment to on a subtable created through more complicated processing. Selecting a subtable creates a copy; modifying that subtable doesn't affect the original table." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even246
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 2 4 6" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "firstrow = numbers.loc[numbers['C'] == 6]\n", "firstrow" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even10046
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 100 4 6" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "firstrow.loc[0, 'A'] = 100\n", "firstrow" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typeABC
0even246
1odd135
\n", "
" ], "text/plain": [ " type A B C\n", "0 even 2 4 6\n", "1 odd 1 3 5" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }