{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "pd.set_option('display.max_rows', 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Describing Different Kinds of Data\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to understand and *describe* a dataset statistically, the observations need to be measured in a quantifiable way. However, the attributes of a dataset vary drastically based on the nature of what is being measured; datasets are often a mixture of numbers, labels, and language-based descriptions. \n", "\n", "Specifying the *kind of data* contained in an attribute helps define strategies to quantify and describe the population in terms of the attribute." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** The dataset below contains information on Health Department inspections for restaurants in San Francisco. Each row describes a different inspection of a restaurant in the city." ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
business_namebusiness_postal_codeinspection_datemonthdayinspection_scorerisk_category
0Sushirrito941112019-03-013186.0Low Risk
1Swensen's of SF Inc941092018-02-1321396.0Low Risk
2Vinyl Cafe and Wine Bar941172017-01-1011077.0High Risk
3Andrea's Bakery941122017-10-25102565.0Moderate Risk
4MORNING DUE941042018-08-098986.0Low Risk
\n", "
" ], "text/plain": [ " business_name business_postal_code inspection_date month day \\\n", "0 Sushirrito 94111 2019-03-01 3 1 \n", "1 Swensen's of SF Inc 94109 2018-02-13 2 13 \n", "2 Vinyl Cafe and Wine Bar 94117 2017-01-10 1 10 \n", "3 Andrea's Bakery 94112 2017-10-25 10 25 \n", "4 MORNING DUE 94104 2018-08-09 8 9 \n", "\n", " inspection_score risk_category \n", "0 86.0 Low Risk \n", "1 96.0 Low Risk \n", "2 77.0 High Risk \n", "3 65.0 Moderate Risk \n", "4 86.0 Low Risk " ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inspections = pd.read_csv('data/inspections.csv')\n", "inspections.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Broadly, there are three different kinds of attributes in the health inspections dataset. Examples of these three types are given below:\n", "\n", "1. The `inspection_score` is a numerical column; calculating mathematical quantities like sums and averages on this column makes sense and represent useful descriptions of the population of health inspections.\n", "1. The `inspection_date` column(s), while composed of numbers, are used as a way to *order events*. That they are represented by numbers is coincidental (e.g. One could represent month \"12\" as \"Dec\") and computing statistics on these numbers often doesn't make sense.\n", "1. The `business_name` is not represented in a usual way by numbers and there is no clear way to do so. Additionally, this fields have no inherent ordering.\n", "\n", "As it is important to understand all of these attributes to understand the dataset; different strategies for describing the fields depends on the *kind* of data the attribute represents." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Kinds of data\n", "\n", "Attributes of a dataset generally fall into one of three types:\n", "1. An attribute is **quantitative** if its values are numeric and standard mathematical operations (e.g. mean, sum, ratios) on those values make sense.\n", "1. An attribute is **ordinal** if its values have an ordering from smallest to greatest. Equivalently, there is a one-to-one correspondence between the values and a subset of the number-line, *for which the order in the data is reflected in the ordering of the number-line*.\n", "1. An attribute is **nominal** if the values are differentiated by only their label; they are neither quantitative nor ordinal.\n", "\n", "An attribute is referred to as **categorical** if it is either ordinal or nominal.\n", "\n", "*Remark:* The classification of attributes into these types are not strict; they merely serve to clarify how to view, process, and analyze an attribute. The classification of an attribute into a certain kind may depend on both *the dataset being considered* as well as *the question being asked*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** In the inspection dataset,\n", "* The `inspection_score` attribute is *quantitative*; for example, one could calculate the average inspection score.\n", "* The `inspection_date` attribute is ordinal, ordering the values from earliest date to the most recent date.\n", "* The `risk_category` attribute is ordinal, ordering the values from `Low Risk` to `High Risk`.\n", "* The `business_name` attribute is nominal, as there is no obvious ordering of restaurant names.\n", "* The `business_postal_code` attribute is nominal, as there is no obvious ordering of zip-codes using integers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** Even though the Month attribute is of numeric type, it is *nominal*. For example, 5 and 6 (May and June) are only close to each other in the data when the observations occur in the same year. However, there are a few subtleties illustrated by the following hypothetical situations:\n", "* If the dataset consists of only a single year, then the Month attribute is likely *ordinal*.\n", "* The meaning of \"close to each other in the data\" depends on the question being asked of the data. For example, if the dataset is answering questions on whether restaurant fail health inspections more often in summer than winter, then a comparison of the inspections between May of two different years may be \"closer\" than a comparison of the inspections that occurred between May and October of the same year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Empirical distributions and kinds of data\n", "\n", "The typical starting point in understanding a fixed dataset is to understand the distribution of values of each attribute. \n", "\n", "The **Empirical Distribution** of an attribute is the distribution of observed data. That is, it describes the proportion of the whole made up by each value. If an attribute is quantitative (and continuous), then the empirical distribution describes the density of *binned* data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }