{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "pd.set_option('display.max_rows', 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Describing Different Kinds of Data\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to understand and *describe* a dataset statistically, the observations need to be measured in a quantifiable way. However, the attributes of a dataset vary drastically based on the nature of what is being measured; datasets are often a mixture of numbers, labels, and language-based descriptions. \n",
    "\n",
    "Specifying the *kind of data* contained in an attribute helps define strategies to quantify and describe the population in terms of the attribute."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example:** The dataset below contains information on Health Department inspections for restaurants in San Francisco. Each row describes a different inspection of a restaurant in the city."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_name</th>\n",
       "      <th>business_postal_code</th>\n",
       "      <th>inspection_date</th>\n",
       "      <th>month</th>\n",
       "      <th>day</th>\n",
       "      <th>inspection_score</th>\n",
       "      <th>risk_category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Sushirrito</td>\n",
       "      <td>94111</td>\n",
       "      <td>2019-03-01</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>86.0</td>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Swensen's of SF Inc</td>\n",
       "      <td>94109</td>\n",
       "      <td>2018-02-13</td>\n",
       "      <td>2</td>\n",
       "      <td>13</td>\n",
       "      <td>96.0</td>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Vinyl Cafe and Wine Bar</td>\n",
       "      <td>94117</td>\n",
       "      <td>2017-01-10</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>77.0</td>\n",
       "      <td>High Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Andrea's Bakery</td>\n",
       "      <td>94112</td>\n",
       "      <td>2017-10-25</td>\n",
       "      <td>10</td>\n",
       "      <td>25</td>\n",
       "      <td>65.0</td>\n",
       "      <td>Moderate Risk</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>MORNING DUE</td>\n",
       "      <td>94104</td>\n",
       "      <td>2018-08-09</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>86.0</td>\n",
       "      <td>Low Risk</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             business_name  business_postal_code inspection_date  month  day  \\\n",
       "0               Sushirrito                 94111      2019-03-01      3    1   \n",
       "1      Swensen's of SF Inc                 94109      2018-02-13      2   13   \n",
       "2  Vinyl Cafe and Wine Bar                 94117      2017-01-10      1   10   \n",
       "3          Andrea's Bakery                 94112      2017-10-25     10   25   \n",
       "4              MORNING DUE                 94104      2018-08-09      8    9   \n",
       "\n",
       "   inspection_score  risk_category  \n",
       "0              86.0       Low Risk  \n",
       "1              96.0       Low Risk  \n",
       "2              77.0      High Risk  \n",
       "3              65.0  Moderate Risk  \n",
       "4              86.0       Low Risk  "
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inspections = pd.read_csv('data/inspections.csv')\n",
    "inspections.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Broadly, there are three different kinds of attributes in the health inspections dataset. Examples of these three types are given below:\n",
    "\n",
    "1. The `inspection_score` is a numerical column; calculating mathematical quantities like sums and averages on this column makes sense and represent useful descriptions of the population of health inspections.\n",
    "1. The `inspection_date` column(s), while composed of numbers, are used as a way to *order events*. That they are represented by numbers is coincidental (e.g. One could represent month \"12\" as \"Dec\") and computing statistics on these numbers often doesn't make sense.\n",
    "1. The `business_name` is not represented in a usual way by numbers and there is no clear way to do so. Additionally, this fields have no inherent ordering.\n",
    "\n",
    "As it is important to understand all of these attributes to understand the dataset; different strategies for describing the fields depends on the *kind* of data the attribute represents."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kinds of data\n",
    "\n",
    "Attributes of a dataset generally fall into one of three types:\n",
    "1. An attribute is **quantitative** if its values are numeric and standard mathematical operations (e.g. mean, sum, ratios) on those values make sense.\n",
    "1. An attribute is **ordinal** if its values have an ordering from smallest to greatest. Equivalently, there is a one-to-one correspondence between the values and a subset of the number-line, *for which the order in the data is reflected in the ordering of the number-line*.\n",
    "1. An attribute is **nominal** if the values are differentiated by only their label; they are neither quantitative nor ordinal.\n",
    "\n",
    "An attribute is referred to as **categorical** if it is either ordinal or nominal.\n",
    "\n",
    "*Remark:* The classification of attributes into these types are not strict; they merely serve to clarify how to view, process, and analyze an attribute. The classification of an attribute into a certain kind may depend on both *the dataset being considered* as well as *the question being asked*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example:** In the inspection dataset,\n",
    "* The `inspection_score` attribute is *quantitative*; for example, one could calculate the average inspection score.\n",
    "* The `inspection_date` attribute is ordinal, ordering the values from earliest date to the most recent date.\n",
    "* The `risk_category` attribute is ordinal, ordering the values from `Low Risk` to `High Risk`.\n",
    "* The `business_name` attribute is nominal, as there is no obvious ordering of restaurant names.\n",
    "* The `business_postal_code` attribute is nominal, as there is no obvious ordering of zip-codes using integers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Example:** Even though the Month attribute is of numeric type, it is *nominal*. For example, 5 and 6 (May and June) are only close to each other in the data when the observations occur in the same year. However, there are a few subtleties illustrated by the following hypothetical situations:\n",
    "* If the dataset consists of only a single year, then the Month attribute is likely *ordinal*.\n",
    "* The meaning of \"close to each other in the data\" depends on the question being asked of the data. For example, if the dataset is answering questions on whether restaurant fail health inspections more often in summer than winter, then a comparison of the inspections between May of two different years may be \"closer\" than a comparison of the inspections that occurred between May and October of the same year."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Empirical distributions and kinds of data\n",
    "\n",
    "The typical starting point in understanding a fixed dataset is to understand the distribution of values of each attribute. \n",
    "\n",
    "The **Empirical Distribution** of an attribute is the distribution of observed data. That is, it describes the proportion of the whole made up by each value. If an attribute is quantitative (and continuous), then the empirical distribution describes the density of *binned* data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}