Methods and Descriptive Statistics
Contents
Methods and Descriptive Statistics¶
Informational Methods¶
Pandas methods that help the user ‘peek at the data’ in different ways (e.g. look at a few rows at a time, count the number of non-null entries, count the number of distinct entries). These methods are particularly useful when the data is too large to look at in its entirety.
Method Name |
Description |
---|---|
|
return the first |
|
return the last |
|
Count the number of non-null entries of a Series |
|
Returns number of unique values of a Series |
Example: The DataFrame
named uswnt
contains information on all soccer players on the US Women’s national team from 1991 through 2019.
uswnt = pd.read_csv('data/world_cups.csv')
# number of rows / columns
uswnt.shape
(147, 9)
# first 7 entries; players from the 90s
uswnt.head(7)
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
0 | Mary Harvey | GK | 26 | 1991 | 6 | 6 | 540 | 0 | 0 |
1 | Julie Foudy | MF | 20 | 1991 | 6 | 6 | 540 | 1 | 2 |
2 | Carla Overbeck | DF | 23 | 1991 | 6 | 6 | 507 | 0 | 0 |
3 | Carin Jennings-Gabarra | FW | 26 | 1991 | 6 | 6 | 491 | 6 | 3 |
4 | Michelle Akers | MFFW | 25 | 1991 | 6 | 6 | 491 | 10 | 1 |
5 | Linda Hamilton | DF | 22 | 1991 | 6 | 5 | 507 | 0 | 0 |
6 | Mia Hamm | MFFW | 19 | 1991 | 6 | 5 | 499 | 2 | 1 |
# last 2 entries; players from 2019
uswnt.tail(2)
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
145 | Allie Long | MF | 31 | 2019 | 1 | 0 | 31 | 0 | 0 |
146 | Emily Sonnett | DF | 25 | 2019 | 1 | 0 | 8 | 0 | 0 |
A look at the Player column:
# Look at players column; `head` also a Series method.
players = uswnt['Player']
players.head()
0 Mary Harvey
1 Julie Foudy
2 Carla Overbeck
3 Carin Jennings-Gabarra
4 Michelle Akers
Name: Player, dtype: object
# no duplicates
players.shape
(147,)
players.count()
147
players.nunique()
76
# Top 5: Most goals in a single world-cup tournament; note the index.
uswnt.sort_values(by='Gls', ascending=False).head()
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
4 | Michelle Akers | MFFW | 25 | 1991 | 6 | 6 | 491 | 10 | 1 |
132 | Alex Morgan | FW | 29 | 2019 | 6 | 6 | 490 | 6 | 3 |
3 | Carin Jennings-Gabarra | FW | 26 | 1991 | 6 | 6 | 491 | 6 | 3 |
110 | Carli Lloyd | MF | 32 | 2015 | 7 | 7 | 630 | 6 | 1 |
136 | Megan Rapinoe | FWMF | 33 | 2019 | 5 | 5 | 429 | 6 | 2 |
# Top goal scorer per world cup for USWNT
(
uswnt
.sort_values(by='Gls', ascending=False)
.drop_duplicates(subset=['Year'])
.sort_values(by='Year')
)
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
4 | Michelle Akers | MFFW | 25 | 1991 | 6 | 6 | 491 | 10 | 1 |
19 | Kristine Lilly | MFFW | 23 | 1995 | 6 | 6 | 518 | 3 | 0 |
41 | Tiffeny Milbrett | FW | 26 | 1999 | 6 | 5 | 509 | 3 | 0 |
60 | Abby Wambach | FW | 23 | 2003 | 6 | 5 | 426 | 3 | 0 |
72 | Abby Wambach | FW | 27 | 2007 | 6 | 6 | 536 | 6 | 0 |
89 | Abby Wambach | FW | 31 | 2011 | 6 | 6 | 600 | 4 | 1 |
110 | Carli Lloyd | MF | 32 | 2015 | 7 | 7 | 630 | 6 | 1 |
132 | Alex Morgan | FW | 29 | 2019 | 6 | 6 | 490 | 6 | 3 |
Array arithmetic¶
Series
can use array arithmetic just like NumpyWarning: arrays indices are lined up before operation! (More on this later)
Example: Compute (1) the minutes played per appearance and (2) each players year of birth.
minutes = uswnt['Min']
apps = uswnt['Apps']
(minutes / apps)
0 90.000000
1 90.000000
2 84.500000
3 81.833333
...
143 90.000000
144 45.000000
145 31.000000
146 8.000000
Length: 147, dtype: float64
year = uswnt['Year']
ages = uswnt['Age']
(year - ages)
0 1965
1 1971
2 1968
3 1965
...
143 1993
144 1988
145 1988
146 1994
Length: 147, dtype: int64
Descriptive methods¶
As noted in the previous section, Series
and DataFrame
objects are Numpy arrays with named labels. As such,
Numpy functions and methods are directly applicable to Pandas objects (particularly
Series
), andmany Pandas methods are inherited from Numpy, often with tweaks to default arguments that are convenient for data analysis.
Example: Applying Numpy functions to a Series
(e.g. a column of a DataFrame
) results in applying the function to the data in the underlying Numpy array.
# mean age of the players
np.sum(ages) / ages.shape[0]
26.421768707482993
# The mean
np.mean(ages)
26.421768707482993
np.median(ages)
26.0
Example: Pandas supplies these Numpy function as Series
methods as well.
ages.mean()
26.421768707482993
ages.median()
26.0
ages.describe()
count 147.000000
mean 26.421769
std 4.298654
min 18.000000
25% 23.000000
50% 26.000000
75% 30.000000
max 39.000000
Name: Age, dtype: float64
Example: The variance is an example of a method that differs between Numpy and Pandas.
In Numpy,
np.var
computes the population variance.In Pandas, the
var
method computes the sample variance.
(((ages - ages.mean())**2).sum() / ages.shape[0])**(1/2)
4.284007866739518
np.std(ages)
4.284007866739518
(((ages - ages.mean())**2).sum() / (ages.shape[0] - 1))**(1/2)
4.298654090204659
ages.std()
4.298654090204659
DataFrame
Methods and the axis
keyword¶
DataFrames share many of the same methods with Series.
The dataFrame method applies the Series method to every row/column.
Some of these methods take the
axis
keyword argument:axis=0
: the method is applied to series with index given by rows.axis=1
: the method is applied to series with index given by columns.
Default value:
axis=0
(apply method to each column).
uswnt.mean()
Age 26.421769
Year 2005.612245
Apps 4.605442
Starts 3.741497
Min 342.789116
Gls 0.918367
Ast 0.476190
dtype: float64
uswnt.max()
Player Wendy Gebauer
Pos MFFWDF
Age 39
Year 2019
...
Starts 7
Min 630
Gls 10
Ast 4
Length: 9, dtype: object
uswnt.head()
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
0 | Mary Harvey | GK | 26 | 1991 | 6 | 6 | 540 | 0 | 0 |
1 | Julie Foudy | MF | 20 | 1991 | 6 | 6 | 540 | 1 | 2 |
2 | Carla Overbeck | DF | 23 | 1991 | 6 | 6 | 507 | 0 | 0 |
3 | Carin Jennings-Gabarra | FW | 26 | 1991 | 6 | 6 | 491 | 6 | 3 |
4 | Michelle Akers | MFFW | 25 | 1991 | 6 | 6 | 491 | 10 | 1 |
uswnt.head().sum(axis=1)
0 2569
1 2566
2 2533
3 2529
4 2530
dtype: int64
uswnt.describe()
Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|
count | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 |
mean | 26.421769 | 2005.612245 | 4.605442 | 3.741497 | 342.789116 | 0.918367 | 0.476190 |
std | 4.298654 | 9.229530 | 1.967398 | 2.427266 | 203.786157 | 1.559374 | 0.862724 |
min | 18.000000 | 1991.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 |
25% | 23.000000 | 1999.000000 | 3.000000 | 1.000000 | 113.500000 | 0.000000 | 0.000000 |
50% | 26.000000 | 2007.000000 | 6.000000 | 5.000000 | 429.000000 | 0.000000 | 0.000000 |
75% | 30.000000 | 2015.000000 | 6.000000 | 6.000000 | 509.500000 | 1.000000 | 1.000000 |
max | 39.000000 | 2019.000000 | 7.000000 | 7.000000 | 630.000000 | 10.000000 | 4.000000 |
The apply
method¶
The apply
method is both a Series and a DataFrame method for applying custom functions across data.
ser.apply(func)
appliesfunc
to the values contained in the Seriesser
,df.apply(func)
appliesfunc
to the columns of the DataFramedf
,df.apply(func, axis=1)
appliesfunc
to the rows of the DataFramedf
.
Remark: Notice that, when applied to a DataFrame, func
should be a function that takes in a Series.
Example: To create a boolean column that describes if a given player’s first name ends in the letter e
, create a custom function to pass to apply
:
def firstname_endswith_e(player):
'''returns True if the first name ends in the letter e'''
fn, _ = player.split(maxsplit=1)
return fn[-1] == 'e'
uswnt['Player'].apply(firstname_endswith_e)
0 False
1 True
2 False
3 False
...
143 False
144 False
145 True
146 False
Name: Player, Length: 147, dtype: bool
The agg
method¶
The agg
method simultaneously applies multiple Series methods to the columns of a DataFrame. Given a DataFrame df
,
df.agg(func)
returns a Series obtained by applying the function to the columns of adf
,df.agg([f1,...,fN])
returns a DataFrame obtained by applying each function to each column ofdf
,df.agg({col1:f1,...,colN:fN})
returns a Series obtained by applying each function to column specified by its corresponding key.Analogously,
agg
can also be passed a dictionary, keyed by column name, of lists of functions.
Remark 1: agg
accepts function/method names as well, represented as strings.
Remark 2: agg
has an axis
keyword argument that applies functions row-wise instead of column-wise.
Example: uswnt.agg('max')
computes the maximum value for each column. This value is also computable using the method directly – that is, uswnt.max()
.
uswnt.agg('max')
Player Wendy Gebauer
Pos MFFWDF
Age 39
Year 2019
...
Starts 7
Min 630
Gls 10
Ast 4
Length: 9, dtype: object
Example: Passing a list of functions into agg
results in a DataFrame whose rows contain the results of applying each function to columns of the original DataFrame. If a function throws an exception upon application to a column, the value in the resulting DataFrame is NaN
.
uswnt.agg(['mean', np.median, 'max'])
Player | Pos | Age | Year | Apps | Starts | Min | Gls | Ast | |
---|---|---|---|---|---|---|---|---|---|
max | Wendy Gebauer | MFFWDF | 39.000000 | 2019.000000 | 7.000000 | 7.000000 | 630.000000 | 10.000000 | 4.00000 |
mean | NaN | NaN | 26.421769 | 2005.612245 | 4.605442 | 3.741497 | 342.789116 | 0.918367 | 0.47619 |
median | NaN | NaN | 26.000000 | 2007.000000 | 6.000000 | 5.000000 | 429.000000 | 0.000000 | 0.00000 |
Example: Similarly, passing a dictionary of functions keyed by column name applies the function only to the specified columns.
uswnt.agg({'Player': 'max', 'Pos': 'min', 'Age': 'mean', 'Ast': 'min'})
Player Wendy Gebauer
Pos 0
Age 26.4218
Ast 0
dtype: object
uswnt.agg({'Player': ['min', 'max'], 'Age': ['mean', np.median, 'max']})
Player | Age | |
---|---|---|
max | Wendy Gebauer | 39.000000 |
mean | NaN | 26.421769 |
median | NaN | 26.000000 |
min | Abby Dahlkemper | NaN |