{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Selecting Data\n", "---\n", "\n", "A given table usually represents a set of measurements for a given population. Oftentimes, one's interest lies in either a specific subpopulation of interest (i.e. rows) or a specific subset of measurements (i.e. columns).\n", "\n", "For example, in the table below, one might want to \n", "* restrict attention to those individuals in their thirties,\n", "* concern oneself with the name and age of the individuals.\n", "\n", "Selecting rows and columns based on this criteria results in a smaller table:\n", "\n", "![selecting a subtable](imgs/selecting-subtables.png)\n", "\n", "The [previous chapter](02/tabular-data.html#selecting-rows-with-loc-and-columns-with-) covered selection of single rows and columns, using `.loc[]` and `[]` respectively. This section covers general approaches to selecting a subtables of a given table. These approaches include:\n", "* selecting explicit subsets of observations (rows) and/or attributes (columns),\n", "* selecting subsets of observations and attributes based on conditions using boolean arrays.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting explicit subsets of rows and columns\n", "\n", "### Row selection\n", "\n", "The dataframe `currencies` contains different currency names, their exchange rate against the US Dollar, and the continent of the given country, indexed by the name of the country manufacturing the currency." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
USAusd1.000North America
JPNyen105.940Asia
EUeuro0.897Europe
MEXpeso19.640North America
CHNrenminbi7.020Asia
\n", "
" ], "text/plain": [ " currency exchange continent\n", "USA usd 1.000 North America\n", "JPN yen 105.940 Asia\n", "EU euro 0.897 Europe\n", "MEX peso 19.640 North America\n", "CHN renminbi 7.020 Asia" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies = pd.DataFrame({\n", " 'currency': ['usd', 'yen', 'euro', 'peso', 'renminbi'],\n", " 'exchange': [1, 105.94, 0.897, 19.64, 7.02],\n", " 'continent': ['North America', 'Asia', 'Europe', 'North America', 'Asia']\n", "}, index=['USA', 'JPN', 'EU', 'MEX', 'CHN'])\n", "\n", "currencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that `currencies.loc['MEX']` selects the attributes corresponding to `MEX`; this row is a *one dimensional* Series object. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "currency peso\n", "exchange 19.64\n", "continent North America\n", "Name: MEX, dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies.loc['MEX']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Selecting a subset of rows results in a potentially *smaller* table. However, such a table is still a *two dimensional* dataframe. To select a subset rows, simply pass a list to `loc`. For example, to create a table from `currencies` consisting of the rows indexed by `JAPAN` and `CHN`:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
JPNyen105.94Asia
CHNrenminbi7.02Asia
\n", "
" ], "text/plain": [ " currency exchange continent\n", "JPN yen 105.94 Asia\n", "CHN renminbi 7.02 Asia" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies.loc[['JPN', 'CHN']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Selecting explicit subsets of rows using `loc`**: Given a dataframe `df` and a subset `idx_list` of the index `df.index`, the dataframe `df.loc[idx_list]` consists of the rows of `df` with index given by `idx_list`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** If the index list consists of one a single index, the resulting object is still a two dimensional dataframe, consisting of a single row." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
USAusd1.0North America
\n", "
" ], "text/plain": [ " currency exchange continent\n", "USA usd 1.0 North America" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "idx_list = ['USA']\n", "currencies.loc[idx_list]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column Selection\n", "\n", "Recall that `currencies['exchange']` selects exchange rate column for each country in the table; this column is a one dimensional object." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "USA 1.000\n", "JPN 105.940\n", "EU 0.897\n", "MEX 19.640\n", "CHN 7.020\n", "Name: exchange, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies['exchange']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to row selection, passing a list of column names produces a dataframe with the columns given in the list. For example, to select only the 'currency' and 'exchange' columns:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchange
USAusd1.000
JPNyen105.940
EUeuro0.897
MEXpeso19.640
CHNrenminbi7.020
\n", "
" ], "text/plain": [ " currency exchange\n", "USA usd 1.000\n", "JPN yen 105.940\n", "EU euro 0.897\n", "MEX peso 19.640\n", "CHN renminbi 7.020" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies[['currency', 'exchange']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Selecting explicit subsets of columns using `[]`**: Given a dataframe `df` and a subset `cols` of the columns `df.columns`, the dataframe `df[cols]` consists of the rows of `df` with columns given by the columns in `cols`.\n", "\n", "**Example:** This method of column selection offers a convenient way of explicitly reordering the columns of a table: pass the full column list in the desired order." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
continentcurrencyexchange
USANorth Americausd1.000
JPNAsiayen105.940
EUEuropeeuro0.897
MEXNorth Americapeso19.640
CHNAsiarenminbi7.020
\n", "
" ], "text/plain": [ " continent currency exchange\n", "USA North America usd 1.000\n", "JPN Asia yen 105.940\n", "EU Europe euro 0.897\n", "MEX North America peso 19.640\n", "CHN Asia renminbi 7.020" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cols = ['continent', 'currency', 'exchange']\n", "currencies[cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selection of subtables via conditions\n", "\n", "Most often, selection of observations and attributes occur via applying some relevant criteria. For example, in a table of survey responses, one may only want to consider:\n", "* responses from respondents of a certain age (row selection) \n", "* answers to questions that have a 100% response rate (column selection).\n", "\n", "Such selections occur in two steps:\n", "1. create a boolean index using a vector comparison that captures the selection logic,\n", "2. pass the boolean index to the Pandas row/column selector. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Boolean indexing\n", "\n", "Boolean indexes are boolean arrays that represent whether or not a condition is met for a given position in an index. Such arrays are created using logical operators on array objects. For example, a boolean index for those countries whose exchange rate is create than one is given by:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "USA False\n", "JPN True\n", "EU False\n", "MEX True\n", "CHN True\n", "Name: exchange, dtype: bool" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies['exchange'] > 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A boolean index that reads `True` if a country is either in Asia, or has an exchange rate less than 1, is given by:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "USA False\n", "JPN True\n", "EU True\n", "MEX False\n", "CHN True\n", "dtype: bool" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting rows using boolean indexes\n", "\n", "Rows of a Pandas dataframe can be selected by passing a boolean index to `loc`. For example, one can create a dataframe consisting of the rows containing JPN and MEX using a boolean array with `True` only in the second and fourth positions:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
JPNyen105.94Asia
MEXpeso19.64North America
\n", "
" ], "text/plain": [ " currency exchange continent\n", "JPN yen 105.94 Asia\n", "MEX peso 19.64 North America" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool_arr = [False, True, False, True, False]\n", "currencies.loc[bool_arr]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The subtable consisting of countries either in Asia or with an exchange rate less than 1 is obtained via:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
JPNyen105.940Asia
EUeuro0.897Europe
CHNrenminbi7.020Asia
\n", "
" ], "text/plain": [ " currency exchange continent\n", "JPN yen 105.940 Asia\n", "EU euro 0.897 Europe\n", "CHN renminbi 7.020 Asia" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asia_or_exch_less_1 = (currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)\n", "currencies.loc[asia_or_exch_less_1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Selecting rows using boolean indexes:** Suppose `df` is a dataframe and `bool_arr` is a boolean array of the same length of `df`. Then `df.loc[bool_arr]` is a dataframe whose rows are the rows of `df` for which the corresponding position in `bool_arr` is `True`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Selecting rows using functions\n", "\n", "Rows of a Pandas DataFrame can also be selected by passing a *function* to loc:\n", "* the function takes in a DataFrame and returns a boolean array;\n", "* this boolean array (applied to the DataFrame at hand) is then passed to loc to select the rows as outlined above.\n", "\n", "**Example:** To select the rows in `currencies` whose currency begins with the letter 'o':" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
EUeuro0.897Europe
MEXpeso19.640North America
\n", "
" ], "text/plain": [ " currency exchange continent\n", "EU euro 0.897 Europe\n", "MEX peso 19.640 North America" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def ends_in_o(df):\n", " '''returns a boolean array representing\n", " whether each row in the currency \n", " column of `df` ends in the letter o.'''\n", " return df['currency'].str.endswith('o')\n", "\n", "currencies.loc[ends_in_o]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**selecting rows using function:** Suppose `df` is a DataFrame and `f` is a function that takes in a DataFrame and returns a boolean array. Then `df.loc[f]` returns the same DataFrame as `df.loc[f(df)]`.\n", "\n", "This technique is useful when using method chaining, as the function generalizes the selection logic without referencing a specific DataFrame.\n", "\n", "**Example:** Without using `sort_values`/`drop_duplicates`, and using method-chaining, return a DataFrame containing the country (or countries) with the largest exchange rate among the currencies that end with the letter 'o'.\n", "\n", "This requires two steps:\n", "1. select the countries whose currency begins with the letter 'o',\n", "1. select the countries in the DataFrame from step 1 whose exchange rate is the largest.\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "def equal_to_max(df):\n", " '''returns a boolean array denoting if the exchange\n", " rate of a row of `df` is equal to the max exchange rate.'''\n", " return df['currency'] == df['currency'].max()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchangecontinent
MEXpeso19.64North America
\n", "
" ], "text/plain": [ " currency exchange continent\n", "MEX peso 19.64 North America" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " currencies\n", " .loc[ends_in_o]\n", " .loc[equal_to_max]\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Remark:* This cannot be done by passing a boolean index directly, as the logic in step 2 refers to the output of step 1, which doesn't have a name! While this example is contrived, this constraint commonly appears when adding and modifying columns of a DataFrame using method chaining." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting subtables using `loc`\n", "\n", "The `loc` selector allows simultaneous selection of rows and columns via matrix notation. That is, given a dataframe `df`, a list of indexes `idx`, and column labels `cols`, the expression `df.loc[idx, cols]` evaluates to the dataframe with rows corresponding to the index `idx` and columns corresponding to the columns in `cols`.\n", "\n", "For example, the currency and exchange rate for MEX and JPN is given by:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchange
MEXpeso19.64
JPNyen105.94
\n", "
" ], "text/plain": [ " currency exchange\n", "MEX peso 19.64\n", "JPN yen 105.94" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "countries = ['MEX', 'JPN']\n", "attributes = ['currency', 'exchange']\n", "currencies.loc[countries, attributes]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly `loc` can also take *pairs* of boolean arrays corresponding rows and columns. Below selects the first and last row and the middle column using boolean arrays:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
exchange
USA1.00
CHN7.02
\n", "
" ], "text/plain": [ " exchange\n", "USA 1.00\n", "CHN 7.02" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies.loc[\n", " [True, False, False, False, True], \n", " [False, True, False]\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The slicing operator `:` selects all rows and/or columns when passed into `loc`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
currencyexchange
USAusd1.000
JPNyen105.940
EUeuro0.897
MEXpeso19.640
CHNrenminbi7.020
\n", "
" ], "text/plain": [ " currency exchange\n", "USA usd 1.000\n", "JPN yen 105.940\n", "EU euro 0.897\n", "MEX peso 19.640\n", "CHN renminbi 7.020" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "currencies.loc[:, attributes]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beware: strange behaviors with row and column selection\n", "* What if your boolean arrays have incorrect length?\n", "* What if your indexes/columns are of boolean type?\n", "* `[]` has the same problems!\n", "* Duplicate column names / duplicate selection?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }