2.1. Introduction to Tabular Data

The Tabular Data Model

Learning outcomes

  • Understand the tabular data model.

  • Describe the contents of simple tables.

Introduction

Investigations into phenomena require taking measurements of events or individuals in question. These two primary components of a study are typically kept in tabular data structures. The rows of a table consist of observations (e.g. individuals or events), whereas the columns of a table contain the measurements made on each individual.

For example, the table below consists of the answers to survey questions sent to a group of people. Each person (individual) provides answers to the questions posed (the measurements).

Respondent Id

Name

Job Title

Favorite Color

Age

120

Alice

CEO

Green

47

121

Bob

Store Manager

Yellow

35

122

Cai

Writer

Orange

38

129

Daniela

Engineer

Pink

26

The anatomy of a table

Tables are two-dimensional data structures holding observations; each observation is composed of different measurements or attributes.

  • A row of a table consists of an individual observation. As a table may contain a wide variety of attributes, a given row may describe a single individual in a variety of ways.

  • A column of a table consists of the collection of a single measurement for all observations. A column consists of data of homogeneous scale on which statistics are calculated. Columns are often called variables or attributes as well.

  • Tables have labels that identify their rows and columns:

    • The row labels are referred to as the index of the table.

    • The column labels are referred to as the column names of the table.

Anatomy of a Table

Example: In the table above, the red column is the favorite color variable, the yellow row contains the survey response for respondent 129, the blue portions of the table contain the index and column labels.

Tables in python, using Pandas

Python’s most popular library for table manipulation is called Pandas. The name derives from the term ‘Panel Data’. Pandas is a large, active library used by a large community of pythond developers engaged in data manipulation. Learning to use this library comes with both the advantages and frustrations of using any active code-base:

The advantages:

  • it has many useful features that makes data manipulation quick and easy,

  • many of Pandas’ features are C-optimized, making it far more performant than pure python code

The disadvantages:

  • the library evolved organically, yielding an interface that is sometimes awkward and unintuitive,

  • much of the library is under active development, so don’t be surprised if newer versions behaving differently!

Pandas typically has many ways of coding the same logic using a wide variety of functions and methods. These notes will restrict usage to a minimal set of the Pandas library that lend themselves toward the general patterns of tabular computing, applicable beyond Pandas (e.g. in SQL, R, Spark).

While the reader is encouraged to explore the myriad functions and methods of the Pandas API, they are also encouraged to attempt to implement solutions using the smaller set of functions considered in these notes. Regardless, the reader is encouraged to explore the Pandas documentation (which is quite good!) and understand the possible inputs and outputs of each function they use.

To begin using Pandas, import the library as follows:

import pandas as pd

Creating DataFrame objects

Depending on the perspective, tables can be considered either a collection of rows (observations) or columns (measurements). As such, a table is constructed by either one of these structures.

To create a table in Pandas, pass either of the following to the DataFrame constructor:

  1. an ordered collection of observations representing the rows of a table (e.g. a list-of-lists), or

  2. an collection of measurements representing the columns of the table (e.g. a dictionary of lists, keyed by column name).

Example: Defining a table via a collection of observations (rows):

columns = ['Name', 'Job Title', 'Favorite Color', 'Age']

data = [
    ['Alice', 'CEO', 'Green', 47],
    ['Bob', 'Store Manager', 'Yellow', 35],
    ['Cai', 'Writer', 'Orange', 38],
    ['Daniela', 'Engineer', 'Pink', 26],
    ['Erykah', 'Marketing', 'Blue', 24]
]

survey = pd.DataFrame(data, columns=columns)
survey
Name Job Title Favorite Color Age
0 Alice CEO Green 47
1 Bob Store Manager Yellow 35
2 Cai Writer Orange 38
3 Daniela Engineer Pink 26
4 Erykah Marketing Blue 24

Example: Defining a DataFrame via a named collection of attributes (columns):

column_dict = {
    'Name': ['Alice', 'Bob', 'Cai', 'Daniela', 'Erykah'],
    'Job Title': ['CEO', 'Store Manager', 'Writer', 'Engineer', 'Marketing'],
    'Favorite Color': ['Green', 'Yellow', 'Orange', 'Pink', 'Blue'],
    'Age': [47, 35, 38, 26, 24]
}

survey = pd.DataFrame(column_dict)
survey
Name Job Title Favorite Color Age
0 Alice CEO Green 47
1 Bob Store Manager Yellow 35
2 Cai Writer Orange 38
3 Daniela Engineer Pink 26
4 Erykah Marketing Blue 24

Reading tables into Pandas from external sources

  • read_csv is the primary Pandas function used to read DataFrames from csv files on disk.

  • Other Pandas functions that read tables from other sources include:

    • read_excel parses Excel workbooks into dictionaries of DataFrames keyed by worksheet name.

    • read_html parses tables on a given html page into a list of DataFrames.

    • read_sql reads tabular data from a database using structured query language.

    • read_json parses JSON record files into a DataFrame.

Remark: These parsing functions have copious options. See the Pandas documentation for details.

pd.read_csv('survey.csv')
Name Job Title Favorite Color Age
0 Alice CEO Green 47
1 Bob Store Manager Yellow 35
2 Cai Writer Orange 38
3 Daniela Engineer Pink 26
4 Erykah Marketing Blue 24

Writing a DataFrame to a csv file is done using the DataFrame method to_csv.

The anatomy of a DataFrame

Pandas tabular data structure is called a DataFrame; they are built upon two-dimensional numpy arrays. The data in these tables are accessed using labels (e.g. columns and indices) and manipulated with a large library of useful table methods.

The defining table properties given in the section above correspond to the following DataFrame properties, each of which will be reviewed in more detail below.

  1. The observations of a DataFrame are labeled by the index.

  2. The columns of a DataFrame are labeled by columns.

  3. A column of a DataFrame is a one dimensional array index by the index of the table (called a Series).

  4. A row of a DataFrame is a one dimensional array indexed by the columns of the table (called a Series).

Basic Properties

A number of basic properties of DataFrame are stored as attributes.

The number of rows and number of columns of a DataFrame are returned as a tuple by the property shape. Note that the survey table has 5 rows and 4 columns.

survey.shape
(5, 4)

The index and columns of a DataFrame are accessed using attributes of the same name. A few relevant observations of tabular labels in Pandas:

  1. The index and columns of a DataFrame are both Pandas index objects,

  2. index objects are immutable numpy arrays,

  3. index/column attributes can be reassigned,

  4. while DataFrame indices default to row-number, an index can generally contain anything.

survey.columns
Index(['Name', 'Job Title', 'Favorite Color', 'Age'], dtype='object')
survey.index
RangeIndex(start=0, stop=5, step=1)
survey.index = 'a b c d e'.split()
survey
Name Job Title Favorite Color Age
a Alice CEO Green 47
b Bob Store Manager Yellow 35
c Cai Writer Orange 38
d Daniela Engineer Pink 26
e Erykah Marketing Blue 24

Row and column elements: Series objects

The rows and columns of a DataFrame are represented using a Pandas.Series, a class of 1-dimensional arrays. Series are 1-dimensional numpy arrays with labels accessed through an index attribute.

  1. A container of Series objects may be passed to the DataFrame constructor to create a table.

  2. Selecting a single row/column of a DataFrame returns a Series.

The Series constructor accepts a 1-dimensional container of data, along with an optional index:

ser = pd.Series([2,4,6,8], index=['a', 'b', 'c', 'd'])
ser
a    2
b    4
c    6
d    8
dtype: int64
ser.index
Index(['a', 'b', 'c', 'd'], dtype='object')
ser.shape
(4,)

Creating a DataFrame using a collection of Series is similar to any container; notice the columns are passed into the Series constructors as opposed to the DataFrame constructor!

columns = ['Name', 'Job Title', 'Favorite Color', 'Age']

data = [
    pd.Series(['Alice', 'CEO', 'Green', 47], index=columns),
    pd.Series(['Bob', 'Store Manager', 'Yellow', 35], index=columns),
    pd.Series(['Cai', 'Writer', 'Orange', 38], index=columns),
    pd.Series(['Daniela', 'Engineer', 'Pink', 26], index=columns),
    pd.Series(['Erykah', 'Marketing', 'Blue', 24], index=columns)
]

survey = pd.DataFrame(data, index='a b c d e'.split())
survey
Name Job Title Favorite Color Age
a Alice CEO Green 47
b Bob Store Manager Yellow 35
c Cai Writer Orange 38
d Daniela Engineer Pink 26
e Erykah Marketing Blue 24

Selecting rows with loc and columns with []

The most basic task of data selection from a table is extracting a row or column. Given a Pandas.DataFrame named df,

  1. df.loc[idx] selects the row indexed by the value idx,

  2. df[col] selects the column labeled by the value col. Each selection results in a Pandas.Series object indexed by the columns/index of df respectively.

Note 1: df.loc is not a DataFrame method, but rather a Pandas indexer object. As such, it uses square brackets.

Note 2: DataFrame objects are column-oriented by default; most tabular computations (e.g. statistics) are done on columns. As such, the list-like __getitem__ syntax takes in a column name and returns a column.

Example: Selecting observation a using loc results in a Series indexed by the columns of survey.

survey.loc['a']
Name              Alice
Job Title           CEO
Favorite Color    Green
Age                  47
Name: a, dtype: object

Example: Selecting the Name column using [] results in a Series indexed by the index of survey.

survey['Name']
a      Alice
b        Bob
c        Cai
d    Daniela
e     Erykah
Name: Name, dtype: object