Pandas Data Types and Performance Considerations


While the rest of the chapter emphasized learning to work with tabular data in Pandas, this section focuses on details of how Pandas implements these data structures. Understanding these details is an necessary if using Pandas as the primary library for tabular manipulation.

While the Pandas library performs many computations very fast, the Pandas library is not developed to be optimized for speed. Instead, Pandas optimizes developer ease. Writing code in Pandas should be easy, fast, and require little overhead. Pandas embraces the Donald Knuth truism that:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

Pandas emphasizes faster-to-code yet slower-to-run design patterns under the assumptions that:

  • The data scientist iteratively performs analyses; to start there’s no clear computational goal.

  • What may be slower to run on a single laptop, may distribute well to a large cluster of servers using a distributed framework like Spark or Dask.

  • Datasets are often small and easily handled with (somewhat) inefficient code (and a sample of the data may help start the analysis).

However, it’s equally important to know the drawbacks and strengths of the library, as it often becomes necessary to push Pandas to the limits of what it can handle. In particular,

  • Resource constraints (e.g. a small laptop) shouldn’t make a complete analysis of a large dataset impossible.

  • Moving from Pandas to a more performant library involves significant developer time. The larger the dataset one can process with Pandas, the faster one can do the work.

Pandas is built upon Numpy

The C-optimized libraries powering Pandas is thanks to the Numpy library (which stands for ‘Numerical python’). The Pandas DataFrame is a columnar collection of Numpy arrays and thus many built-in DataFrame methods are fast Numpy methods applied across the column(s) of a DataFrame.

Pandas is written to take advantage of Numpy performance, which leverages vectorized code. As such, it relies on the python interpreter knowing:

  1. the plan of execution ahead of time.

  2. the type of possible inputs and outputs that methods will use.

  3. the size of possible inputs and outputs.

Tip: These needs translate to a few practical habits that lead to faster and more efficient Pandas code:

  1. Never loop over the rows of a DataFrame (operations over columns are vectorized by Numpy array calculations).

  2. Use built-in DataFrame methods on columns, over custom python functions, whenever possible (these functions are C-optimized Numpy methods).

  3. Explicitly type the data if memory is an issue (more on that later!).

Example: Use the %timeit magic in Jupyter Notebooks to assess the difference in execution time. The DataFrame below contains one million observations containing five attributes, each with a value between 0 and 10. Using %timeit, one compares the execution time of taking the average of each of the columns via:

  • the built-in Numpy method,

  • a ‘user-defined’ average function, applied using the apply method,

  • looping through the rows and columns of the data to compute the mean in pure python.

data = pd.DataFrame(np.random.randint(0,10,size=(10**6,5)), columns='a b c d e'.split())
data
a b c d e
0 0 0 7 4 6
1 5 9 4 1 6
2 3 1 5 3 8
... ... ... ... ... ...
999997 1 4 7 6 2
999998 2 0 4 9 1
999999 6 7 5 1 5

1000000 rows × 5 columns

%timeit data.mean()
17 ms ± 962 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit data.apply(lambda x:sum(x) / len(x))
601 ms ± 24.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit 

means = []
for c in data.columns:
    s = 0
    for x in data[c]:
        s += x
    means.append(s / len(data[c]))
857 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Notice that the built-in mean method is faster by a 30!

Remark: The %timeit times a single line of code, while %%timeit times the execution of an entire cell.

Example: The DataFrame method DataFrame.info() displays the column types, along with memory consumption:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
a    1000000 non-null int64
b    1000000 non-null int64
c    1000000 non-null int64
d    1000000 non-null int64
e    1000000 non-null int64
dtypes: int64(5)
memory usage: 38.1 MB

Data Types

In Pandas, a Data Type is a classification that specifies the type of the values of a column. Understanding data types in Pandas leads to cleaner, better optimized (in both space and time), less error-prone code. In particular, Pandas handling of data types lends itself to the creation of hard-to-spot computational errors.

The data types Pandas stores in its DataFrames are roughly the same as the Numpy data types. However, Pandas diverges from Numpy in a few ways:

  • Pandas infers the correct data types from pure python types, or types on disk (and is sometimes wrong!).

  • Pandas allocates large amounts of memory for data types by default (prioritizes correctness over efficiency).

  • A column’s data type determines which operations can be applied to it:

    • Numpy arrays are by default of homogeneous data type.

    • Pandas DataFrames are heterogeneous, column oriented tables. The columns are homogeneous, implying that column methods are fast.

  • Pandas makes heavy use of the object data-type, which contains generic ‘object’ values that may be of mixed type. Performing operations on these columns is slow.

The DataFrame attribute dtypes returns a Series of data-types of each column, indexed by column name:

data.dtypes
a    int64
b    int64
c    int64
d    int64
e    int64
dtype: object

The table below contains a list of Pandas data types and their equivalents in other common scenarios:

Pandas dtype

Python type

NumPy type

SQL type

Usage

object

NA

object

NA

Mixed types

object

str

string, unicode

NA

Text

int64

int

int_, int8,…,int64, uint8,…,uint64

INT, BIGINT

Integer numbers

float64

float

float_, float16, float32, float64

FLOAT

Floating point numbers

bool

bool

bool_

BOOL

True/False values

datetime64

datetime

datetime64[ns]

DATETIME

Date and time values

timedelta[ns]

timedelta

NA

NA

Differences between two datetimes

category

NA

NA

ENUM

Finite list of text values

Remark: Numpy improves performance by explicitly controlling the precision of the values contained in an array. While great for speed and space, these options are a hassle to constantly specify; Pandas always defaults to 64 bits.

Example: Numpy and Pandas follow different conventions for data type inference. Numpy coerces array values to a homogeneous type, whereas Pandas defaults to using a mixed ‘object’ type.

The data define below consists of a single observation with two attributes: a single character (string) and a single integer.

  1. By default, Numpy coerces the integer to a string, resulting in an array of type <U1 (unicode string of length 1).

  2. Pandas stores each column in its own array, each of a different type.

  3. The dtype can be explicitly when defining the array.

data = [['a', 1]]
np.array(data)
array([['a', '1']], dtype='<U1')
pd.DataFrame(data).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 2 columns):
0    1 non-null object
1    1 non-null int64
dtypes: int64(1), object(1)
memory usage: 96.0+ bytes
np.array(data, dtype=np.object)
array([['a', 1]], dtype=object)

Representing missing data

Missing Data in Pandas is represented by a special value NaN that stands for ‘Not a Number’. These values are common to many programming languages and share a common specification. In particular, NaN values are floating point numbers that behave in peculiar ways shared with other unusual mathematical objects like \(\infty\). For example, it is often the output of methods returning a value for a zero-division.

A NaN value can be defined by hand using Numpy’s implementation: np.NaN.

The behavior of NaN with respect to comparisons is unusual, and is summarized in the following table:

Comparison

NaN x

NaN x

NaN > x

NaN < x

NaN = x

NaN x

Result

Always False

Always False

Always False

Always False

Always False

Always True

Thus, when checking when a given value is NaN, one should always use a function or method implemented for such comparisons:

list_of_values = [0, 1, np.NaN, 3, 4, np.NaN, 6]
list_of_values
[0, 1, nan, 3, 4, nan, 6]

The following code intends to loop through the elements of the list, stating when a value is present or missing. Since it uses a == comparison to np.NaN, the code incorrectly states that every element is there!

for k, x in enumerate(list_of_values):
    if x == np.NaN:
        print('index %d is missing' % k)
    else:
        print('index %d is present' % k)
index 0 is present
index 1 is present
index 2 is present
index 3 is present
index 4 is present
index 5 is present
index 6 is present

Using the function pd.isnull returns the correct result:

for k, x in enumerate(list_of_values):
    if pd.isnull(x):
        print('index %d is missing' % k)
    else:
        print('index %d is present' % k)
index 0 is present
index 1 is present
index 2 is missing
index 3 is present
index 4 is present
index 5 is missing
index 6 is present

The Pandas Series and DataFrame method isnull() returns a boolean array that indicates whether a given entry is missing.

pd.Series(list_of_values).isnull()
0    False
1    False
2     True
3    False
4    False
5     True
6    False
dtype: bool

This boolean array, for example, can select all non-missing rows of a DataFrame.

Remark: There is also a nan value that is implemented in pure python. While similar to Numpy’s version, it differs in one significant way: Numpy’s NaN is behaves like a single value, that always occupies the same location in memory; Python’s nan values are objects that occupy different locations in memory whenever a new object is instantiated. As Numpy’s implementation is more performant, always fall back to using np.NaN when possible.

The example below illustrates this observation. A python nan value is instantiated by float('nan').

The variables a,b are two different instances of python’s nan object, whereas c,d both represent np.NaN. Notice all of them have the same string representation:

a, b, c, d = float('nan'), float('nan'), np.NaN, np.NaN
a, b, c, d
(nan, nan, nan, nan)

Using the deep comparison operator is, the observation above becomes clear:

  • The same python nan value verifies as occupying the same location in memory,

  • The two different python nan values occupy different locations in memory,

  • The two different np.NaN values occupy the same location in memory.

a is a, a is b, c is d
(True, False, True)

Copies and Views in Pandas

The values DataFrame attribute accesses the underlying Numpy array of a DataFrame. Accessing the underlying Numpy data structure allows one to resort to (in-place) Numpy operations when performance becomes an issue. However, it’s unfortunately quite complicated and unpredictable to understand exactly how the underlying Numpy array is being stored and used by Pandas.

In particular, Pandas stores a DataFrame in two different ways, depending on the context:

  1. Pandas will create a DataFrame from a copy of an array in most situations. For such an array, the .values attribute, along with all DataFrame methods, will also return a copy.

  2. Pandas will sometimes create a DataFrame as a view (or reference) of an existing array. This results in huge performance gains, at the expense of possible side-effects from in-place modifications. The most common situation for which this occurs is when a DataFrame contains homogeneous data (and thus can store an unmodified multidimensional Numpy array). However, such situations are not guaranteed and should never be depended upon!

Example: The follow example illustrates that an in-place modification of a DataFrame works in some cases (homogeneous data) and not others. The table homogeneous consists only of integers and .values returns a reference to an existing array:

homogeneous = pd.DataFrame([[0,1,2],[3,4,5]], columns='a b c'.split())
homogeneous
a b c
0 0 1 2
1 3 4 5
homogeneous.values
array([[0, 1, 2],
       [3, 4, 5]])

An in-place reassignment of the upper-left value of the array homogeneous.values results in a changed DataFrame:

homogeneous.values[0][0] = 1
homogeneous
a b c
0 1 1 2
1 3 4 5

On the other hand, the table heterogeneous contains a column of strings alongside the same DataFrame as before. In this case, .values returns a copy that cannot be modified in-place!

heterogeneous = pd.DataFrame([[0,1,2,'a'],[3,4,5,'b']], columns='a b c d'.split())
heterogeneous
a b c d
0 0 1 2 a
1 3 4 5 b
heterogeneous.values
array([[0, 1, 2, 'a'],
       [3, 4, 5, 'b']], dtype=object)

An in-place reassignment of the upper-left value of the array heterogeneous.values results in a unchanged DataFrame:

heterogeneous.values[0][0] = 1
heterogeneous
a b c d
0 0 1 2 a
1 3 4 5 b

Method Chaining

Pandas usually returns copies:

  • this pattern is intentional (functional programming paradigm): no side-effects

  • can create problems when assigning variables to each sub-step (e.g. needlessly high memory usage)

  • fix this by chaining together methods!

    • other pros: easy to read, understand, and change steps.

    • don’t litter the namespace with dozens of temporary variable names

Example: Saw in previous section, calculating the top goal scorer in each world-cup was a three step process:

  • sort the table of players descending by the number of goals scored in each tournament (sort_values),

  • keep only the first (i.e. highest) number for each year (drop_duplicates),

  • sort the table by tournament year.

Procedurally, one can implement these steps exactly as outlined in English above:

uswnt_by_goals = uswnt.sort_values(by='Gls', ascending=False)
most_goals_per_year = uswnt_by_goals.drop_duplicates(subset=['Year'])
most_goals_per_year_sorted = uswnt_most_goals_per_year.sort_values(by='Year')

This approach:

  • is hard to read, due to introducing unnecessary variables at each step,

  • creates copies of each DataFrame, assigning them to names at every step; due to these assignments, python does not release this memory until later than necessary.

Much better to use method chaining, applying methods directly to the output of the previous method application. Python’s indentation makes the steps easy to read and parse. Moreover, as each subsequent copy is never assigned to a variable, the interpreter knows it can release the memory as soon as it executes the method.

(
    uswnt
    .sort_values(by='Gls', ascending=False)
    .drop_duplicates(subset=['Year'])
    .sort_values(by='Year')
)

Remark: Many Pandas methods have an ‘inplace’ keyword. Surprisingly, this option does not result in an in-place operation. It still returns a copy that is reassigned to the variable being modified. As such, the inplace keyword should never be used; it will eventually be removed from the library entirely.