{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Selecting Data\n",
    "---\n",
    "\n",
    "A given table usually represents a set of measurements for a given population. Oftentimes, one's interest lies in either a specific subpopulation of interest (i.e. rows) or a specific subset of measurements (i.e. columns).\n",
    "\n",
    "For example, in the table below, one might want to \n",
    "* restrict attention to those individuals in their thirties,\n",
    "* concern oneself with the name and age of the individuals.\n",
    "\n",
    "Selecting rows and columns based on this criteria results in a smaller table:\n",
    "\n",
    "![selecting a subtable](imgs/selecting-subtables.png)\n",
    "\n",
    "The [previous chapter](02/tabular-data.html#selecting-rows-with-loc-and-columns-with-) covered selection of single rows and columns, using `.loc[]` and `[]` respectively. This section covers general approaches to selecting a subtables of a given table. These approaches include:\n",
    "* selecting explicit subsets of observations (rows) and/or attributes (columns),\n",
    "* selecting subsets of observations and attributes based on conditions using boolean arrays.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting explicit subsets of rows and columns\n",
    "\n",
    "### Row selection\n",
    "\n",
    "The dataframe `currencies` contains different currency names, their exchange rate against the US Dollar, and the continent of the given country, indexed by the name of the country manufacturing the currency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>usd</td>\n",
       "      <td>1.000</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.940</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "      <td>Europe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.640</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.020</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     currency  exchange      continent\n",
       "USA       usd     1.000  North America\n",
       "JPN       yen   105.940           Asia\n",
       "EU       euro     0.897         Europe\n",
       "MEX      peso    19.640  North America\n",
       "CHN  renminbi     7.020           Asia"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies = pd.DataFrame({\n",
    "    'currency': ['usd', 'yen', 'euro', 'peso', 'renminbi'],\n",
    "    'exchange': [1, 105.94, 0.897, 19.64, 7.02],\n",
    "    'continent': ['North America', 'Asia', 'Europe', 'North America', 'Asia']\n",
    "}, index=['USA', 'JPN', 'EU', 'MEX', 'CHN'])\n",
    "\n",
    "currencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that `currencies.loc['MEX']` selects the attributes corresponding to `MEX`; this row is a *one dimensional* Series object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "currency              peso\n",
       "exchange             19.64\n",
       "continent    North America\n",
       "Name: MEX, dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies.loc['MEX']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting a subset of rows results in a potentially *smaller* table. However, such a table is still a *two dimensional* dataframe. To select a subset rows, simply pass a list to `loc`. For example, to create a table from `currencies` consisting of the rows indexed by `JAPAN` and `CHN`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.94</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.02</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     currency  exchange continent\n",
       "JPN       yen    105.94      Asia\n",
       "CHN  renminbi      7.02      Asia"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies.loc[['JPN', 'CHN']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Selecting explicit subsets of rows using `loc`**: Given a dataframe `df` and a subset `idx_list` of the index `df.index`, the dataframe `df.loc[idx_list]` consists of the rows of `df` with index given by `idx_list`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example:** If the index list consists of one a single index, the resulting object is still a two dimensional dataframe, consisting of a single row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>usd</td>\n",
       "      <td>1.0</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    currency  exchange      continent\n",
       "USA      usd       1.0  North America"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "idx_list = ['USA']\n",
    "currencies.loc[idx_list]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Column Selection\n",
    "\n",
    "Recall that `currencies['exchange']` selects exchange rate column for each country in the table; this column is a one dimensional object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "USA      1.000\n",
       "JPN    105.940\n",
       "EU       0.897\n",
       "MEX     19.640\n",
       "CHN      7.020\n",
       "Name: exchange, dtype: float64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies['exchange']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to row selection, passing a list of column names produces a dataframe with the columns given in the list. For example, to select only the 'currency' and 'exchange' columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>usd</td>\n",
       "      <td>1.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.640</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     currency  exchange\n",
       "USA       usd     1.000\n",
       "JPN       yen   105.940\n",
       "EU       euro     0.897\n",
       "MEX      peso    19.640\n",
       "CHN  renminbi     7.020"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies[['currency', 'exchange']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Selecting explicit subsets of columns using `[]`**: Given a dataframe `df` and a subset `cols` of the columns `df.columns`, the dataframe `df[cols]` consists of the rows of `df` with columns given by the columns in `cols`.\n",
    "\n",
    "**Example:** This method of column selection offers a convenient way of explicitly reordering the columns of a table: pass the full column list in the desired order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>continent</th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>North America</td>\n",
       "      <td>usd</td>\n",
       "      <td>1.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>Asia</td>\n",
       "      <td>yen</td>\n",
       "      <td>105.940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>Europe</td>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>North America</td>\n",
       "      <td>peso</td>\n",
       "      <td>19.640</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>Asia</td>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         continent  currency  exchange\n",
       "USA  North America       usd     1.000\n",
       "JPN           Asia       yen   105.940\n",
       "EU          Europe      euro     0.897\n",
       "MEX  North America      peso    19.640\n",
       "CHN           Asia  renminbi     7.020"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cols = ['continent', 'currency', 'exchange']\n",
    "currencies[cols]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selection of subtables via conditions\n",
    "\n",
    "Most often, selection of observations and attributes occur via applying some relevant criteria. For example, in a table of survey responses, one may only want to consider:\n",
    "* responses from respondents of a certain age (row selection) \n",
    "* answers to questions that have a 100% response rate (column selection).\n",
    "\n",
    "Such selections occur in two steps:\n",
    "1. create a boolean index using a vector comparison that captures the selection logic,\n",
    "2. pass the boolean index to the Pandas row/column selector. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Boolean indexing\n",
    "\n",
    "Boolean indexes are boolean arrays that represent whether or not a condition is met for a given position in an index. Such arrays are created using logical operators on array objects. For example, a boolean index for those countries whose exchange rate is create than one is given by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "USA    False\n",
       "JPN     True\n",
       "EU     False\n",
       "MEX     True\n",
       "CHN     True\n",
       "Name: exchange, dtype: bool"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies['exchange'] > 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A boolean index that reads `True` if a country is either in Asia, or has an exchange rate less than 1, is given by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "USA    False\n",
       "JPN     True\n",
       "EU      True\n",
       "MEX    False\n",
       "CHN     True\n",
       "dtype: bool"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting rows using boolean indexes\n",
    "\n",
    "Rows of a Pandas dataframe can be selected by passing a boolean index to `loc`. For example, one can create a dataframe consisting of the rows containing JPN and MEX using a boolean array with `True` only in the second and fourth positions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.94</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.64</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    currency  exchange      continent\n",
       "JPN      yen    105.94           Asia\n",
       "MEX     peso     19.64  North America"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bool_arr = [False, True, False, True, False]\n",
    "currencies.loc[bool_arr]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The subtable consisting of countries either in Asia or with an exchange rate less than 1 is obtained via:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.940</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "      <td>Europe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.020</td>\n",
       "      <td>Asia</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     currency  exchange continent\n",
       "JPN       yen   105.940      Asia\n",
       "EU       euro     0.897    Europe\n",
       "CHN  renminbi     7.020      Asia"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asia_or_exch_less_1 = (currencies['continent'] == 'Asia') | (currencies['exchange'] < 1)\n",
    "currencies.loc[asia_or_exch_less_1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Selecting rows using boolean indexes:** Suppose `df` is a dataframe and `bool_arr` is a boolean array of the same length of `df`. Then `df.loc[bool_arr]` is a dataframe whose rows are the rows of `df` for which the corresponding position in `bool_arr` is `True`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Selecting rows using functions\n",
    "\n",
    "Rows of a Pandas DataFrame can also be selected by passing a *function* to loc:\n",
    "* the function takes in a DataFrame and returns a boolean array;\n",
    "* this boolean array (applied to the DataFrame at hand) is then passed to loc to select the rows as outlined above.\n",
    "\n",
    "**Example:** To select the rows in `currencies` whose currency begins with the letter 'o':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "      <td>Europe</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.640</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    currency  exchange      continent\n",
       "EU      euro     0.897         Europe\n",
       "MEX     peso    19.640  North America"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def ends_in_o(df):\n",
    "    '''returns a boolean array representing\n",
    "    whether each row in the currency \n",
    "    column of `df` ends in  the letter o.'''\n",
    "    return df['currency'].str.endswith('o')\n",
    "\n",
    "currencies.loc[ends_in_o]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**selecting rows using function:** Suppose `df` is a DataFrame and `f` is a function that takes in a DataFrame and returns a boolean array. Then `df.loc[f]` returns the same DataFrame as `df.loc[f(df)]`.\n",
    "\n",
    "This technique is useful when using method chaining, as the function generalizes the selection logic without referencing a specific DataFrame.\n",
    "\n",
    "**Example:** Without using `sort_values`/`drop_duplicates`, and using method-chaining, return a DataFrame containing the country (or countries) with the largest exchange rate among the currencies that end with the letter 'o'.\n",
    "\n",
    "This requires two steps:\n",
    "1. select the countries whose currency begins with the letter 'o',\n",
    "1. select the countries in the DataFrame from step 1 whose exchange rate is the largest.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "def equal_to_max(df):\n",
    "    '''returns a boolean array denoting if the exchange\n",
    "    rate of a row of `df` is equal to the max exchange rate.'''\n",
    "    return df['currency'] == df['currency'].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "      <th>continent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.64</td>\n",
       "      <td>North America</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    currency  exchange      continent\n",
       "MEX     peso     19.64  North America"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    currencies\n",
    "    .loc[ends_in_o]\n",
    "    .loc[equal_to_max]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Remark:* This cannot be done by passing a boolean index directly, as the logic in step 2 refers to the output of step 1, which doesn't have a name! While this example is contrived, this constraint commonly appears when adding and modifying columns of a DataFrame using method chaining."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting subtables using `loc`\n",
    "\n",
    "The `loc` selector allows simultaneous selection of rows and columns via matrix notation. That is, given a dataframe `df`, a list of indexes `idx`, and column labels `cols`, the expression `df.loc[idx, cols]` evaluates to the dataframe with rows corresponding to the index `idx` and columns corresponding to the columns in `cols`.\n",
    "\n",
    "For example, the currency and exchange rate for MEX and JPN is given by:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.64</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.94</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    currency  exchange\n",
       "MEX     peso     19.64\n",
       "JPN      yen    105.94"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "countries = ['MEX', 'JPN']\n",
    "attributes = ['currency', 'exchange']\n",
    "currencies.loc[countries, attributes]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly `loc` can also take *pairs* of boolean arrays corresponding rows and columns. Below selects the first and last row and the middle column using boolean arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>exchange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>1.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>7.02</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     exchange\n",
       "USA      1.00\n",
       "CHN      7.02"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies.loc[\n",
    "    [True, False, False, False, True], \n",
    "    [False, True, False]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The slicing operator `:` selects all rows and/or columns when passed into `loc`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>currency</th>\n",
       "      <th>exchange</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>USA</th>\n",
       "      <td>usd</td>\n",
       "      <td>1.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>JPN</th>\n",
       "      <td>yen</td>\n",
       "      <td>105.940</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>EU</th>\n",
       "      <td>euro</td>\n",
       "      <td>0.897</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>MEX</th>\n",
       "      <td>peso</td>\n",
       "      <td>19.640</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CHN</th>\n",
       "      <td>renminbi</td>\n",
       "      <td>7.020</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     currency  exchange\n",
       "USA       usd     1.000\n",
       "JPN       yen   105.940\n",
       "EU       euro     0.897\n",
       "MEX      peso    19.640\n",
       "CHN  renminbi     7.020"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "currencies.loc[:, attributes]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Beware: strange behaviors with row and column selection\n",
    "* What if your boolean arrays have incorrect length?\n",
    "* What if your indexes/columns are of boolean type?\n",
    "* `[]` has the same problems!\n",
    "* Duplicate column names / duplicate selection?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}