Methods and Descriptive Statistics¶

Informational Methods¶

Pandas methods that help the user ‘peek at the data’ in different ways (e.g. look at a few rows at a time, count the number of non-null entries, count the number of distinct entries). These methods are particularly useful when the data is too large to look at in its entirety.

Method Name	Description
`head`	return the first `n` entries of a Series
`tail`	return the last `n` entries of a Series
`count`	Count the number of non-null entries of a Series
`nunique`	Returns number of unique values of a Series

Example: The DataFrame named uswnt contains information on all soccer players on the US Women’s national team from 1991 through 2019.

uswnt = pd.read_csv('data/world_cups.csv')

# number of rows / columns
uswnt.shape

(147, 9)

# first 7 entries; players from the 90s
uswnt.head(7)

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
0	Mary Harvey	GK	26	1991	6	6	540	0	0
1	Julie Foudy	MF	20	1991	6	6	540	1	2
2	Carla Overbeck	DF	23	1991	6	6	507	0	0
3	Carin Jennings-Gabarra	FW	26	1991	6	6	491	6	3
4	Michelle Akers	MFFW	25	1991	6	6	491	10	1
5	Linda Hamilton	DF	22	1991	6	5	507	0	0
6	Mia Hamm	MFFW	19	1991	6	5	499	2	1

# last 2 entries; players from 2019
uswnt.tail(2)

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
145	Allie Long	MF	31	2019	1	0	31	0	0
146	Emily Sonnett	DF	25	2019	1	0	8	0	0

A look at the Player column:

# Look at players column; `head` also a Series method.
players = uswnt['Player']
players.head()

             Mary Harvey
             Julie Foudy
          Carla Overbeck
  Carin Jennings-Gabarra
          Michelle Akers
Name: Player, dtype: object

# no duplicates
players.shape

(147,)

players.count()

players.nunique()

# Top 5: Most goals in a single world-cup tournament; note the index.
uswnt.sort_values(by='Gls', ascending=False).head()

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
4	Michelle Akers	MFFW	25	1991	6	6	491	10	1
132	Alex Morgan	FW	29	2019	6	6	490	6	3
3	Carin Jennings-Gabarra	FW	26	1991	6	6	491	6	3
110	Carli Lloyd	MF	32	2015	7	7	630	6	1
136	Megan Rapinoe	FWMF	33	2019	5	5	429	6	2

# Top goal scorer per world cup for USWNT
(
    uswnt
    .sort_values(by='Gls', ascending=False)
    .drop_duplicates(subset=['Year'])
    .sort_values(by='Year')
)

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
4	Michelle Akers	MFFW	25	1991	6	6	491	10	1
19	Kristine Lilly	MFFW	23	1995	6	6	518	3	0
41	Tiffeny Milbrett	FW	26	1999	6	5	509	3	0
60	Abby Wambach	FW	23	2003	6	5	426	3	0
72	Abby Wambach	FW	27	2007	6	6	536	6	0
89	Abby Wambach	FW	31	2011	6	6	600	4	1
110	Carli Lloyd	MF	32	2015	7	7	630	6	1
132	Alex Morgan	FW	29	2019	6	6	490	6	3

Array arithmetic¶

Series can use array arithmetic just like Numpy
Warning: arrays indices are lined up before operation! (More on this later)

Example: Compute (1) the minutes played per appearance and (2) each players year of birth.

minutes = uswnt['Min']
apps = uswnt['Apps']

(minutes / apps)

    90.000000
    90.000000
    84.500000
    81.833333
         ...    
  90.000000
  45.000000
  31.000000
   8.000000
Length: 147, dtype: float64

year = uswnt['Year']
ages = uswnt['Age']

(year - ages)

    1965
    1971
    1968
    1965
       ... 
  1993
  1988
  1988
  1994
Length: 147, dtype: int64

Descriptive methods¶

As noted in the previous section, Series and DataFrame objects are Numpy arrays with named labels. As such,

Numpy functions and methods are directly applicable to Pandas objects (particularly Series), and
many Pandas methods are inherited from Numpy, often with tweaks to default arguments that are convenient for data analysis.

Example: Applying Numpy functions to a Series (e.g. a column of a DataFrame) results in applying the function to the data in the underlying Numpy array.

# mean age of the players
np.sum(ages) / ages.shape[0]

26.421768707482993

# The mean
np.mean(ages)

26.421768707482993

np.median(ages)

26.0

Example: Pandas supplies these Numpy function as Series methods as well.

ages.mean()

26.421768707482993

ages.median()

26.0

ages.describe()

count    147.000000
mean      26.421769
std        4.298654
min       18.000000
25%       23.000000
50%       26.000000
75%       30.000000
max       39.000000
Name: Age, dtype: float64

Example: The variance is an example of a method that differs between Numpy and Pandas.

In Numpy, np.var computes the population variance.
In Pandas, the var method computes the sample variance.

(((ages - ages.mean())**2).sum() / ages.shape[0])**(1/2)

4.284007866739518

np.std(ages)

4.284007866739518

(((ages - ages.mean())**2).sum() / (ages.shape[0] - 1))**(1/2)

4.298654090204659

ages.std()

4.298654090204659

`DataFrame` Methods and the `axis` keyword¶

DataFrames share many of the same methods with Series.
- The dataFrame method applies the Series method to every row/column.
Some of these methods take the axis keyword argument:
- axis=0: the method is applied to series with index given by rows.
- axis=1: the method is applied to series with index given by columns.
Default value: axis=0 (apply method to each column).

uswnt.mean() 

Age         26.421769
Year      2005.612245
Apps         4.605442
Starts       3.741497
Min        342.789116
Gls          0.918367
Ast          0.476190
dtype: float64

uswnt.max()

Player    Wendy Gebauer
Pos              MFFWDF
Age                  39
Year               2019
              ...      
Starts                7
Min                 630
Gls                  10
Ast                   4
Length: 9, dtype: object

uswnt.head()

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
0	Mary Harvey	GK	26	1991	6	6	540	0	0
1	Julie Foudy	MF	20	1991	6	6	540	1	2
2	Carla Overbeck	DF	23	1991	6	6	507	0	0
3	Carin Jennings-Gabarra	FW	26	1991	6	6	491	6	3
4	Michelle Akers	MFFW	25	1991	6	6	491	10	1

uswnt.head().sum(axis=1)

  2569
  2566
  2533
  2529
  2530
dtype: int64

uswnt.describe()

	Age	Year	Apps	Starts	Min	Gls	Ast
count	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000
mean	26.421769	2005.612245	4.605442	3.741497	342.789116	0.918367	0.476190
std	4.298654	9.229530	1.967398	2.427266	203.786157	1.559374	0.862724
min	18.000000	1991.000000	1.000000	0.000000	2.000000	0.000000	0.000000
25%	23.000000	1999.000000	3.000000	1.000000	113.500000	0.000000	0.000000
50%	26.000000	2007.000000	6.000000	5.000000	429.000000	0.000000	0.000000
75%	30.000000	2015.000000	6.000000	6.000000	509.500000	1.000000	1.000000
max	39.000000	2019.000000	7.000000	7.000000	630.000000	10.000000	4.000000

The `apply` method¶

The apply method is both a Series and a DataFrame method for applying custom functions across data.

ser.apply(func) applies func to the values contained in the Series ser,
df.apply(func) applies func to the columns of the DataFrame df,
df.apply(func, axis=1) applies func to the rows of the DataFrame df.

Remark: Notice that, when applied to a DataFrame, func should be a function that takes in a Series.

Example: To create a boolean column that describes if a given player’s first name ends in the letter e, create a custom function to pass to apply:

def firstname_endswith_e(player):
    '''returns True if the first name ends in the letter e'''
    fn, _ = player.split(maxsplit=1)
    return fn[-1] == 'e'

uswnt['Player'].apply(firstname_endswith_e)

    False
     True
    False
    False
       ...  
  False
  False
   True
  False
Name: Player, Length: 147, dtype: bool

The `agg` method¶

The agg method simultaneously applies multiple Series methods to the columns of a DataFrame. Given a DataFrame df,

df.agg(func) returns a Series obtained by applying the function to the columns of a df,
df.agg([f1,...,fN]) returns a DataFrame obtained by applying each function to each column of df,
df.agg({col1:f1,...,colN:fN}) returns a Series obtained by applying each function to column specified by its corresponding key.
Analogously, agg can also be passed a dictionary, keyed by column name, of lists of functions.

Remark 1: agg accepts function/method names as well, represented as strings.

Remark 2: agg has an axis keyword argument that applies functions row-wise instead of column-wise.

Example: uswnt.agg('max') computes the maximum value for each column. This value is also computable using the method directly – that is, uswnt.max().

uswnt.agg('max')

Player    Wendy Gebauer
Pos              MFFWDF
Age                  39
Year               2019
              ...      
Starts                7
Min                 630
Gls                  10
Ast                   4
Length: 9, dtype: object

Example: Passing a list of functions into agg results in a DataFrame whose rows contain the results of applying each function to columns of the original DataFrame. If a function throws an exception upon application to a column, the value in the resulting DataFrame is NaN.

uswnt.agg(['mean', np.median, 'max'])

	Player	Pos	Age	Year	Apps	Starts	Min	Gls	Ast
max	Wendy Gebauer	MFFWDF	39.000000	2019.000000	7.000000	7.000000	630.000000	10.000000	4.00000
mean	NaN	NaN	26.421769	2005.612245	4.605442	3.741497	342.789116	0.918367	0.47619
median	NaN	NaN	26.000000	2007.000000	6.000000	5.000000	429.000000	0.000000	0.00000

Example: Similarly, passing a dictionary of functions keyed by column name applies the function only to the specified columns.

uswnt.agg({'Player': 'max', 'Pos': 'min', 'Age': 'mean', 'Ast': 'min'})

Player    Wendy Gebauer
Pos                   0
Age             26.4218
Ast                   0
dtype: object

uswnt.agg({'Player': ['min', 'max'], 'Age': ['mean', np.median, 'max']})

	Player	Age
max	Wendy Gebauer	39.000000
mean	NaN	26.421769
median	NaN	26.000000
min	Abby Dahlkemper	NaN

Data Science in Practice

Methods and Descriptive Statistics

Contents

Methods and Descriptive Statistics¶

Informational Methods¶

Array arithmetic¶

Descriptive methods¶

`DataFrame` Methods and the `axis` keyword¶

The `apply` method¶

The `agg` method¶

Data Science in Practice

Methods and Descriptive Statistics

Contents

Methods and Descriptive Statistics¶

Informational Methods¶

Array arithmetic¶

Descriptive methods¶

DataFrame Methods and the axis keyword¶

The apply method¶

The agg method¶

`DataFrame` Methods and the `axis` keyword¶

The `apply` method¶

The `agg` method¶