{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Selecting Data\n",
"---\n",
"\n",
"A given table usually represents a set of measurements for a given population. Oftentimes, one's interest lies in either a specific subpopulation of interest (i.e. rows) or a specific subset of measurements (i.e. columns).\n",
"\n",
"For example, in the table below, one might want to \n",
"* restrict attention to those individuals in their thirties,\n",
"* concern oneself with the name and age of the individuals.\n",
"\n",
"Selecting rows and columns based on this criteria results in a smaller table:\n",
"\n",
"![selecting a subtable](imgs/selecting-subtables.png)\n",
"\n",
"The [previous chapter](02/tabular-data.html#selecting-rows-with-loc-and-columns-with-) covered selection of single rows and columns, using `.loc[]` and `[]` respectively. This section covers general approaches to selecting a subtables of a given table. These approaches include:\n",
"* selecting explicit subsets of observations (rows) and/or attributes (columns),\n",
"* selecting subsets of observations and attributes based on conditions using boolean arrays.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting explicit subsets of rows and columns\n",
"\n",
"### Row selection\n",
"\n",
"The dataframe `currencies` contains different currency names, their exchange rate against the US Dollar, and the continent of the given country, indexed by the name of the country manufacturing the currency."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" usd | \n",
" 1.000 | \n",
" North America | \n",
"
\n",
" \n",
" JPN | \n",
" yen | \n",
" 105.940 | \n",
" Asia | \n",
"
\n",
" \n",
" EU | \n",
" euro | \n",
" 0.897 | \n",
" Europe | \n",
"
\n",
" \n",
" MEX | \n",
" peso | \n",
" 19.640 | \n",
" North America | \n",
"
\n",
" \n",
" CHN | \n",
" renminbi | \n",
" 7.020 | \n",
" Asia | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"USA usd 1.000 North America\n",
"JPN yen 105.940 Asia\n",
"EU euro 0.897 Europe\n",
"MEX peso 19.640 North America\n",
"CHN renminbi 7.020 Asia"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies = pd.DataFrame({\n",
" 'currency': ['usd', 'yen', 'euro', 'peso', 'renminbi'],\n",
" 'exchange': [1, 105.94, 0.897, 19.64, 7.02],\n",
" 'continent': ['North America', 'Asia', 'Europe', 'North America', 'Asia']\n",
"}, index=['USA', 'JPN', 'EU', 'MEX', 'CHN'])\n",
"\n",
"currencies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that `currencies.loc['MEX']` selects the attributes corresponding to `MEX`; this row is a *one dimensional* Series object. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"currency peso\n",
"exchange 19.64\n",
"continent North America\n",
"Name: MEX, dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies.loc['MEX']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting a subset of rows results in a potentially *smaller* table. However, such a table is still a *two dimensional* dataframe. To select a subset rows, simply pass a list to `loc`. For example, to create a table from `currencies` consisting of the rows indexed by `JAPAN` and `CHN`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" JPN | \n",
" yen | \n",
" 105.94 | \n",
" Asia | \n",
"
\n",
" \n",
" CHN | \n",
" renminbi | \n",
" 7.02 | \n",
" Asia | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"JPN yen 105.94 Asia\n",
"CHN renminbi 7.02 Asia"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies.loc[['JPN', 'CHN']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Selecting explicit subsets of rows using `loc`**: Given a dataframe `df` and a subset `idx_list` of the index `df.index`, the dataframe `df.loc[idx_list]` consists of the rows of `df` with index given by `idx_list`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Example:** If the index list consists of one a single index, the resulting object is still a two dimensional dataframe, consisting of a single row."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" usd | \n",
" 1.0 | \n",
" North America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"USA usd 1.0 North America"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"idx_list = ['USA']\n",
"currencies.loc[idx_list]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Column Selection\n",
"\n",
"Recall that `currencies['exchange']` selects exchange rate column for each country in the table; this column is a one dimensional object."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"USA 1.000\n",
"JPN 105.940\n",
"EU 0.897\n",
"MEX 19.640\n",
"CHN 7.020\n",
"Name: exchange, dtype: float64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies['exchange']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to row selection, passing a list of column names produces a dataframe with the columns given in the list. For example, to select only the 'currency' and 'exchange' columns:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" usd | \n",
" 1.000 | \n",
"
\n",
" \n",
" JPN | \n",
" yen | \n",
" 105.940 | \n",
"
\n",
" \n",
" EU | \n",
" euro | \n",
" 0.897 | \n",
"
\n",
" \n",
" MEX | \n",
" peso | \n",
" 19.640 | \n",
"
\n",
" \n",
" CHN | \n",
" renminbi | \n",
" 7.020 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange\n",
"USA usd 1.000\n",
"JPN yen 105.940\n",
"EU euro 0.897\n",
"MEX peso 19.640\n",
"CHN renminbi 7.020"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies[['currency', 'exchange']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Selecting explicit subsets of columns using `[]`**: Given a dataframe `df` and a subset `cols` of the columns `df.columns`, the dataframe `df[cols]` consists of the rows of `df` with columns given by the columns in `cols`.\n",
"\n",
"**Example:** This method of column selection offers a convenient way of explicitly reordering the columns of a table: pass the full column list in the desired order."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" continent | \n",
" currency | \n",
" exchange | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" North America | \n",
" usd | \n",
" 1.000 | \n",
"
\n",
" \n",
" JPN | \n",
" Asia | \n",
" yen | \n",
" 105.940 | \n",
"
\n",
" \n",
" EU | \n",
" Europe | \n",
" euro | \n",
" 0.897 | \n",
"
\n",
" \n",
" MEX | \n",
" North America | \n",
" peso | \n",
" 19.640 | \n",
"
\n",
" \n",
" CHN | \n",
" Asia | \n",
" renminbi | \n",
" 7.020 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" continent currency exchange\n",
"USA North America usd 1.000\n",
"JPN Asia yen 105.940\n",
"EU Europe euro 0.897\n",
"MEX North America peso 19.640\n",
"CHN Asia renminbi 7.020"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cols = ['continent', 'currency', 'exchange']\n",
"currencies[cols]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selection of subtables via conditions\n",
"\n",
"Most often, selection of observations and attributes occur via applying some relevant criteria. For example, in a table of survey responses, one may only want to consider:\n",
"* responses from respondents of a certain age (row selection) \n",
"* answers to questions that have a 100% response rate (column selection).\n",
"\n",
"Such selections occur in two steps:\n",
"1. create a boolean index using a vector comparison that captures the selection logic,\n",
"2. pass the boolean index to the Pandas row/column selector. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Boolean indexing\n",
"\n",
"Boolean indexes are boolean arrays that represent whether or not a condition is met for a given position in an index. Such arrays are created using logical operators on array objects. For example, a boolean index for those countries whose exchange rate is create than one is given by:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"USA False\n",
"JPN True\n",
"EU False\n",
"MEX True\n",
"CHN True\n",
"Name: exchange, dtype: bool"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies['exchange'] > 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A boolean index that reads `True` if a country is either in Asia, or has an exchange rate less than 1, is given by:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"USA False\n",
"JPN True\n",
"EU True\n",
"MEX False\n",
"CHN True\n",
"dtype: bool"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Selecting rows using boolean indexes\n",
"\n",
"Rows of a Pandas dataframe can be selected by passing a boolean index to `loc`. For example, one can create a dataframe consisting of the rows containing JPN and MEX using a boolean array with `True` only in the second and fourth positions:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" JPN | \n",
" yen | \n",
" 105.94 | \n",
" Asia | \n",
"
\n",
" \n",
" MEX | \n",
" peso | \n",
" 19.64 | \n",
" North America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"JPN yen 105.94 Asia\n",
"MEX peso 19.64 North America"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"bool_arr = [False, True, False, True, False]\n",
"currencies.loc[bool_arr]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The subtable consisting of countries either in Asia or with an exchange rate less than 1 is obtained via:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" JPN | \n",
" yen | \n",
" 105.940 | \n",
" Asia | \n",
"
\n",
" \n",
" EU | \n",
" euro | \n",
" 0.897 | \n",
" Europe | \n",
"
\n",
" \n",
" CHN | \n",
" renminbi | \n",
" 7.020 | \n",
" Asia | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"JPN yen 105.940 Asia\n",
"EU euro 0.897 Europe\n",
"CHN renminbi 7.020 Asia"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"asia_or_exch_less_1 = (currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)\n",
"currencies.loc[asia_or_exch_less_1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Selecting rows using boolean indexes:** Suppose `df` is a dataframe and `bool_arr` is a boolean array of the same length of `df`. Then `df.loc[bool_arr]` is a dataframe whose rows are the rows of `df` for which the corresponding position in `bool_arr` is `True`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Selecting rows using functions\n",
"\n",
"Rows of a Pandas DataFrame can also be selected by passing a *function* to loc:\n",
"* the function takes in a DataFrame and returns a boolean array;\n",
"* this boolean array (applied to the DataFrame at hand) is then passed to loc to select the rows as outlined above.\n",
"\n",
"**Example:** To select the rows in `currencies` whose currency begins with the letter 'o':"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" EU | \n",
" euro | \n",
" 0.897 | \n",
" Europe | \n",
"
\n",
" \n",
" MEX | \n",
" peso | \n",
" 19.640 | \n",
" North America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"EU euro 0.897 Europe\n",
"MEX peso 19.640 North America"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def ends_in_o(df):\n",
" '''returns a boolean array representing\n",
" whether each row in the currency \n",
" column of `df` ends in the letter o.'''\n",
" return df['currency'].str.endswith('o')\n",
"\n",
"currencies.loc[ends_in_o]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**selecting rows using function:** Suppose `df` is a DataFrame and `f` is a function that takes in a DataFrame and returns a boolean array. Then `df.loc[f]` returns the same DataFrame as `df.loc[f(df)]`.\n",
"\n",
"This technique is useful when using method chaining, as the function generalizes the selection logic without referencing a specific DataFrame.\n",
"\n",
"**Example:** Without using `sort_values`/`drop_duplicates`, and using method-chaining, return a DataFrame containing the country (or countries) with the largest exchange rate among the currencies that end with the letter 'o'.\n",
"\n",
"This requires two steps:\n",
"1. select the countries whose currency begins with the letter 'o',\n",
"1. select the countries in the DataFrame from step 1 whose exchange rate is the largest.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"def equal_to_max(df):\n",
" '''returns a boolean array denoting if the exchange\n",
" rate of a row of `df` is equal to the max exchange rate.'''\n",
" return df['currency'] == df['currency'].max()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
" continent | \n",
"
\n",
" \n",
" \n",
" \n",
" MEX | \n",
" peso | \n",
" 19.64 | \n",
" North America | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange continent\n",
"MEX peso 19.64 North America"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
" currencies\n",
" .loc[ends_in_o]\n",
" .loc[equal_to_max]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Remark:* This cannot be done by passing a boolean index directly, as the logic in step 2 refers to the output of step 1, which doesn't have a name! While this example is contrived, this constraint commonly appears when adding and modifying columns of a DataFrame using method chaining."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting subtables using `loc`\n",
"\n",
"The `loc` selector allows simultaneous selection of rows and columns via matrix notation. That is, given a dataframe `df`, a list of indexes `idx`, and column labels `cols`, the expression `df.loc[idx, cols]` evaluates to the dataframe with rows corresponding to the index `idx` and columns corresponding to the columns in `cols`.\n",
"\n",
"For example, the currency and exchange rate for MEX and JPN is given by:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
"
\n",
" \n",
" \n",
" \n",
" MEX | \n",
" peso | \n",
" 19.64 | \n",
"
\n",
" \n",
" JPN | \n",
" yen | \n",
" 105.94 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange\n",
"MEX peso 19.64\n",
"JPN yen 105.94"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"countries = ['MEX', 'JPN']\n",
"attributes = ['currency', 'exchange']\n",
"currencies.loc[countries, attributes]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similarly `loc` can also take *pairs* of boolean arrays corresponding rows and columns. Below selects the first and last row and the middle column using boolean arrays:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" exchange | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" 1.00 | \n",
"
\n",
" \n",
" CHN | \n",
" 7.02 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" exchange\n",
"USA 1.00\n",
"CHN 7.02"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies.loc[\n",
" [True, False, False, False, True], \n",
" [False, True, False]\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The slicing operator `:` selects all rows and/or columns when passed into `loc`:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" currency | \n",
" exchange | \n",
"
\n",
" \n",
" \n",
" \n",
" USA | \n",
" usd | \n",
" 1.000 | \n",
"
\n",
" \n",
" JPN | \n",
" yen | \n",
" 105.940 | \n",
"
\n",
" \n",
" EU | \n",
" euro | \n",
" 0.897 | \n",
"
\n",
" \n",
" MEX | \n",
" peso | \n",
" 19.640 | \n",
"
\n",
" \n",
" CHN | \n",
" renminbi | \n",
" 7.020 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" currency exchange\n",
"USA usd 1.000\n",
"JPN yen 105.940\n",
"EU euro 0.897\n",
"MEX peso 19.640\n",
"CHN renminbi 7.020"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"currencies.loc[:, attributes]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Beware: strange behaviors with row and column selection\n",
"* What if your boolean arrays have incorrect length?\n",
"* What if your indexes/columns are of boolean type?\n",
"* `[]` has the same problems!\n",
"* Duplicate column names / duplicate selection?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}