Methods and Descriptive Statistics

Informational Methods

Pandas methods that help the user ‘peek at the data’ in different ways (e.g. look at a few rows at a time, count the number of non-null entries, count the number of distinct entries). These methods are particularly useful when the data is too large to look at in its entirety.

Method Name

Description

head

return the first n entries of a Series

tail

return the last n entries of a Series

count

Count the number of non-null entries of a Series

nunique

Returns number of unique values of a Series

Example: The DataFrame named uswnt contains information on all soccer players on the US Women’s national team from 1991 through 2019.

uswnt = pd.read_csv('data/world_cups.csv')
# number of rows / columns
uswnt.shape
(147, 9)
# first 7 entries; players from the 90s
uswnt.head(7)
Player Pos Age Year Apps Starts Min Gls Ast
0 Mary Harvey GK 26 1991 6 6 540 0 0
1 Julie Foudy MF 20 1991 6 6 540 1 2
2 Carla Overbeck DF 23 1991 6 6 507 0 0
3 Carin Jennings-Gabarra FW 26 1991 6 6 491 6 3
4 Michelle Akers MFFW 25 1991 6 6 491 10 1
5 Linda Hamilton DF 22 1991 6 5 507 0 0
6 Mia Hamm MFFW 19 1991 6 5 499 2 1
# last 2 entries; players from 2019
uswnt.tail(2)
Player Pos Age Year Apps Starts Min Gls Ast
145 Allie Long MF 31 2019 1 0 31 0 0
146 Emily Sonnett DF 25 2019 1 0 8 0 0

A look at the Player column:

# Look at players column; `head` also a Series method.
players = uswnt['Player']
players.head()
0               Mary Harvey
1               Julie Foudy
2            Carla Overbeck
3    Carin Jennings-Gabarra
4            Michelle Akers
Name: Player, dtype: object
# no duplicates
players.shape
(147,)
players.count()
147
players.nunique()
76
# Top 5: Most goals in a single world-cup tournament; note the index.
uswnt.sort_values(by='Gls', ascending=False).head()
Player Pos Age Year Apps Starts Min Gls Ast
4 Michelle Akers MFFW 25 1991 6 6 491 10 1
132 Alex Morgan FW 29 2019 6 6 490 6 3
3 Carin Jennings-Gabarra FW 26 1991 6 6 491 6 3
110 Carli Lloyd MF 32 2015 7 7 630 6 1
136 Megan Rapinoe FWMF 33 2019 5 5 429 6 2
# Top goal scorer per world cup for USWNT
(
    uswnt
    .sort_values(by='Gls', ascending=False)
    .drop_duplicates(subset=['Year'])
    .sort_values(by='Year')
)
Player Pos Age Year Apps Starts Min Gls Ast
4 Michelle Akers MFFW 25 1991 6 6 491 10 1
19 Kristine Lilly MFFW 23 1995 6 6 518 3 0
41 Tiffeny Milbrett FW 26 1999 6 5 509 3 0
60 Abby Wambach FW 23 2003 6 5 426 3 0
72 Abby Wambach FW 27 2007 6 6 536 6 0
89 Abby Wambach FW 31 2011 6 6 600 4 1
110 Carli Lloyd MF 32 2015 7 7 630 6 1
132 Alex Morgan FW 29 2019 6 6 490 6 3

Array arithmetic

  • Series can use array arithmetic just like Numpy

  • Warning: arrays indices are lined up before operation! (More on this later)

Example: Compute (1) the minutes played per appearance and (2) each players year of birth.

minutes = uswnt['Min']
apps = uswnt['Apps']

(minutes / apps)
0      90.000000
1      90.000000
2      84.500000
3      81.833333
         ...    
143    90.000000
144    45.000000
145    31.000000
146     8.000000
Length: 147, dtype: float64
year = uswnt['Year']
ages = uswnt['Age']

(year - ages)
0      1965
1      1971
2      1968
3      1965
       ... 
143    1993
144    1988
145    1988
146    1994
Length: 147, dtype: int64

Descriptive methods

As noted in the previous section, Series and DataFrame objects are Numpy arrays with named labels. As such,

  • Numpy functions and methods are directly applicable to Pandas objects (particularly Series), and

  • many Pandas methods are inherited from Numpy, often with tweaks to default arguments that are convenient for data analysis.

Example: Applying Numpy functions to a Series (e.g. a column of a DataFrame) results in applying the function to the data in the underlying Numpy array.

# mean age of the players
np.sum(ages) / ages.shape[0]
26.421768707482993
# The mean
np.mean(ages)
26.421768707482993
np.median(ages)
26.0

Example: Pandas supplies these Numpy function as Series methods as well.

ages.mean()
26.421768707482993
ages.median()
26.0
ages.describe()
count    147.000000
mean      26.421769
std        4.298654
min       18.000000
25%       23.000000
50%       26.000000
75%       30.000000
max       39.000000
Name: Age, dtype: float64

Example: The variance is an example of a method that differs between Numpy and Pandas.

  • In Numpy, np.var computes the population variance.

  • In Pandas, the var method computes the sample variance.

(((ages - ages.mean())**2).sum() / ages.shape[0])**(1/2)
4.284007866739518
np.std(ages)
4.284007866739518
(((ages - ages.mean())**2).sum() / (ages.shape[0] - 1))**(1/2)
4.298654090204659
ages.std()
4.298654090204659

DataFrame Methods and the axis keyword

  • DataFrames share many of the same methods with Series.

    • The dataFrame method applies the Series method to every row/column.

  • Some of these methods take the axis keyword argument:

    • axis=0: the method is applied to series with index given by rows.

    • axis=1: the method is applied to series with index given by columns.

  • Default value: axis=0 (apply method to each column).

uswnt.mean() 
Age         26.421769
Year      2005.612245
Apps         4.605442
Starts       3.741497
Min        342.789116
Gls          0.918367
Ast          0.476190
dtype: float64
uswnt.max()
Player    Wendy Gebauer
Pos              MFFWDF
Age                  39
Year               2019
              ...      
Starts                7
Min                 630
Gls                  10
Ast                   4
Length: 9, dtype: object
uswnt.head()
Player Pos Age Year Apps Starts Min Gls Ast
0 Mary Harvey GK 26 1991 6 6 540 0 0
1 Julie Foudy MF 20 1991 6 6 540 1 2
2 Carla Overbeck DF 23 1991 6 6 507 0 0
3 Carin Jennings-Gabarra FW 26 1991 6 6 491 6 3
4 Michelle Akers MFFW 25 1991 6 6 491 10 1
uswnt.head().sum(axis=1)
0    2569
1    2566
2    2533
3    2529
4    2530
dtype: int64
uswnt.describe()
Age Year Apps Starts Min Gls Ast
count 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000 147.000000
mean 26.421769 2005.612245 4.605442 3.741497 342.789116 0.918367 0.476190
std 4.298654 9.229530 1.967398 2.427266 203.786157 1.559374 0.862724
min 18.000000 1991.000000 1.000000 0.000000 2.000000 0.000000 0.000000
25% 23.000000 1999.000000 3.000000 1.000000 113.500000 0.000000 0.000000
50% 26.000000 2007.000000 6.000000 5.000000 429.000000 0.000000 0.000000
75% 30.000000 2015.000000 6.000000 6.000000 509.500000 1.000000 1.000000
max 39.000000 2019.000000 7.000000 7.000000 630.000000 10.000000 4.000000

The apply method

The apply method is both a Series and a DataFrame method for applying custom functions across data.

  • ser.apply(func) applies func to the values contained in the Series ser,

  • df.apply(func) applies func to the columns of the DataFrame df,

  • df.apply(func, axis=1) applies func to the rows of the DataFrame df.

Remark: Notice that, when applied to a DataFrame, func should be a function that takes in a Series.

Example: To create a boolean column that describes if a given player’s first name ends in the letter e, create a custom function to pass to apply:

def firstname_endswith_e(player):
    '''returns True if the first name ends in the letter e'''
    fn, _ = player.split(maxsplit=1)
    return fn[-1] == 'e'
uswnt['Player'].apply(firstname_endswith_e)
0      False
1       True
2      False
3      False
       ...  
143    False
144    False
145     True
146    False
Name: Player, Length: 147, dtype: bool

The agg method

The agg method simultaneously applies multiple Series methods to the columns of a DataFrame. Given a DataFrame df,

  • df.agg(func) returns a Series obtained by applying the function to the columns of a df,

  • df.agg([f1,...,fN]) returns a DataFrame obtained by applying each function to each column of df,

  • df.agg({col1:f1,...,colN:fN}) returns a Series obtained by applying each function to column specified by its corresponding key.

  • Analogously, agg can also be passed a dictionary, keyed by column name, of lists of functions.

Remark 1: agg accepts function/method names as well, represented as strings.

Remark 2: agg has an axis keyword argument that applies functions row-wise instead of column-wise.

Example: uswnt.agg('max') computes the maximum value for each column. This value is also computable using the method directly – that is, uswnt.max().

uswnt.agg('max')
Player    Wendy Gebauer
Pos              MFFWDF
Age                  39
Year               2019
              ...      
Starts                7
Min                 630
Gls                  10
Ast                   4
Length: 9, dtype: object

Example: Passing a list of functions into agg results in a DataFrame whose rows contain the results of applying each function to columns of the original DataFrame. If a function throws an exception upon application to a column, the value in the resulting DataFrame is NaN.

uswnt.agg(['mean', np.median, 'max'])
Player Pos Age Year Apps Starts Min Gls Ast
max Wendy Gebauer MFFWDF 39.000000 2019.000000 7.000000 7.000000 630.000000 10.000000 4.00000
mean NaN NaN 26.421769 2005.612245 4.605442 3.741497 342.789116 0.918367 0.47619
median NaN NaN 26.000000 2007.000000 6.000000 5.000000 429.000000 0.000000 0.00000

Example: Similarly, passing a dictionary of functions keyed by column name applies the function only to the specified columns.

uswnt.agg({'Player': 'max', 'Pos': 'min', 'Age': 'mean', 'Ast': 'min'})
Player    Wendy Gebauer
Pos                   0
Age             26.4218
Ast                   0
dtype: object
uswnt.agg({'Player': ['min', 'max'], 'Age': ['mean', np.median, 'max']})
Player Age
max Wendy Gebauer 39.000000
mean NaN 26.421769
median NaN 26.000000
min Abby Dahlkemper NaN