Describing Different Kinds of Data
Describing Different Kinds of Data¶
In order to understand and describe a dataset statistically, the observations need to be measured in a quantifiable way. However, the attributes of a dataset vary drastically based on the nature of what is being measured; datasets are often a mixture of numbers, labels, and language-based descriptions.
Specifying the kind of data contained in an attribute helps define strategies to quantify and describe the population in terms of the attribute.
Example: The dataset below contains information on Health Department inspections for restaurants in San Francisco. Each row describes a different inspection of a restaurant in the city.
inspections = pd.read_csv('data/inspections.csv') inspections.head()
|1||Swensen's of SF Inc||94109||2018-02-13||2||13||96.0||Low Risk|
|2||Vinyl Cafe and Wine Bar||94117||2017-01-10||1||10||77.0||High Risk|
|3||Andrea's Bakery||94112||2017-10-25||10||25||65.0||Moderate Risk|
|4||MORNING DUE||94104||2018-08-09||8||9||86.0||Low Risk|
Broadly, there are three different kinds of attributes in the health inspections dataset. Examples of these three types are given below:
inspection_scoreis a numerical column; calculating mathematical quantities like sums and averages on this column makes sense and represent useful descriptions of the population of health inspections.
inspection_datecolumn(s), while composed of numbers, are used as a way to order events. That they are represented by numbers is coincidental (e.g. One could represent month “12” as “Dec”) and computing statistics on these numbers often doesn’t make sense.
business_nameis not represented in a usual way by numbers and there is no clear way to do so. Additionally, this fields have no inherent ordering.
As it is important to understand all of these attributes to understand the dataset; different strategies for describing the fields depends on the kind of data the attribute represents.
Kinds of data¶
Attributes of a dataset generally fall into one of three types:
An attribute is quantitative if its values are numeric and standard mathematical operations (e.g. mean, sum, ratios) on those values make sense.
An attribute is ordinal if its values have an ordering from smallest to greatest. Equivalently, there is a one-to-one correspondence between the values and a subset of the number-line, for which the order in the data is reflected in the ordering of the number-line.
An attribute is nominal if the values are differentiated by only their label; they are neither quantitative nor ordinal.
An attribute is referred to as categorical if it is either ordinal or nominal.
Remark: The classification of attributes into these types are not strict; they merely serve to clarify how to view, process, and analyze an attribute. The classification of an attribute into a certain kind may depend on both the dataset being considered as well as the question being asked.
Example: In the inspection dataset,
inspection_scoreattribute is quantitative; for example, one could calculate the average inspection score.
inspection_dateattribute is ordinal, ordering the values from earliest date to the most recent date.
risk_categoryattribute is ordinal, ordering the values from
business_nameattribute is nominal, as there is no obvious ordering of restaurant names.
business_postal_codeattribute is nominal, as there is no obvious ordering of zip-codes using integers.
Example: Even though the Month attribute is of numeric type, it is nominal. For example, 5 and 6 (May and June) are only close to each other in the data when the observations occur in the same year. However, there are a few subtleties illustrated by the following hypothetical situations:
If the dataset consists of only a single year, then the Month attribute is likely ordinal.
The meaning of “close to each other in the data” depends on the question being asked of the data. For example, if the dataset is answering questions on whether restaurant fail health inspections more often in summer than winter, then a comparison of the inspections between May of two different years may be “closer” than a comparison of the inspections that occurred between May and October of the same year.
Empirical distributions and kinds of data¶
The typical starting point in understanding a fixed dataset is to understand the distribution of values of each attribute.
The Empirical Distribution of an attribute is the distribution of observed data. That is, it describes the proportion of the whole made up by each value. If an attribute is quantitative (and continuous), then the empirical distribution describes the density of binned data.