An Example Investigation¶

This section goes through a data-driven investigation that illustrates common techniques in the Data Scientist’s toolkit: assessing the dataset, formulating a question, joining new data to improve an answer, assessing the applicability of a potential answer beyond the dataset, and due diligence on the validity of each step in the reasoning.

The code in this section is hidden to keep the focus on the data and reasoning. The techniques (and code) used in this section are covered in Part I of this book.

San Diego City Salaries¶

The dataset at hand includes a list of all San Diego city employee salaries for the year of 2017. This includes employee names and job titles, as well as the components of their total pay during that year.

SD Employee Salaries: background research¶

Why try to understand this dataset?

Such a profile could inform 3rd party workplace programs.
Journalists might search for salary anomalies.
Auditors may want actionable advice on fair employment practices.
Are the earning of these employees fair?

Before formulating goals and questions, an initial look is useful understand the data by building a profile of San Diego city employees.

An initial look: what to keep in mind¶

The data, along with a brief description, is available from Transparent California salary website.

Some questions to keep in mind while looking at the dataset:

What columns are required for potential questions that might be asked?
Which columns have the relevant information?
Does that information likely reflect real-life salaries or are there mistakes in the data?
Are the data in the table self-consistent?

salary_path = os.path.join('data', 'san-diego-2017.csv')
salaries = pd.read_csv(salary_path)
salaries.reset_index(drop=True)

	Employee Name	Job Title	Base Pay	Overtime Pay	Other Pay	Benefits	Total Pay	Total Pay & Benefits	Year	Notes	Agency	Status
0	David Xxxx	Fire Battalion Chief	81917.0	172590.0	68870.00	5194.0	323377.0	328571.0	2017	NaN	San Diego	FT
1	Glen Xxxx	Fire Captain	85904.0	120682.0	99408.00	10107.0	305994.0	316101.0	2017	NaN	San Diego	FT
2	Scott Xxxx	Chief Operating Officer	255000.0	0.0	31164.00	22688.0	286164.0	308852.0	2017	NaN	San Diego	FT
...	...	...	...	...	...	...	...	...	...	...	...	...
12490	Stephen Xxxx	Council Rep 2 A	0.0	0.0	8.00	0.0	8.0	8.0	2017	NaN	San Diego	PT
12491	Tania Xxxx	Sr Mgmt Anlyst	0.0	0.0	8.00	0.0	8.0	8.0	2017	NaN	San Diego	PT
12492	Brian Xxxx	Police Officer	0.0	0.0	3.00	0.0	3.0	3.0	2017	NaN	San Diego	PT

12493 rows × 12 columns

Aside on privacy and ethics¶

This dataset contains the full names of city employees; these names correspond to real people. Such information is called PII, which stands for personally identifiable information. In many cases it is illegal to improperly disclose PII. In this case public employees are public figures and it’s not against the law to disclose their names.

However, there is a difference between what is legal and what is ethical. In this case, the difference between ‘public record’ and ‘searchable record’ may have very different impact on someone’s life. Public record exists to give people who want to know a small amount of relevant information the ability to do so. Collecting all public records for a single person in one place creates an intimate picture of that person’s life that may easily cross into invasion of privacy.

The takeaway: there are people behind each data point, and those people’s data should not be needlessly propagated! In this case, the last names of the employees have been blanked out, even though it’s not legally necessary to do so.

Basic description of employee pay¶

The table below contains a basic of description of employee pay. What does typical pay look like? Are there questions about data reliability?

salaries.describe().T

	count	mean	std	min	25%	50%	75%	max
Base Pay	12493.0	48843.853918	29377.449188	0.0	28888.0	49254.0	68952.0	255000.0
Overtime Pay	12493.0	6573.031858	15308.455700	-623.0	0.0	393.0	5408.0	196978.0
Benefits	12493.0	8667.754983	6456.089561	-29.0	2785.0	8420.0	13485.0	36030.0
Total Pay	12493.0	64984.971024	41812.990357	3.0	36430.0	61452.0	91400.0	323377.0
Total Pay & Benefits	12493.0	73652.726007	45761.462637	3.0	44788.0	71264.0	103131.0	328571.0
Year	12493.0	2017.000000	0.000000	2017.0	2017.0	2017.0	2017.0	2017.0
Notes	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN

A few of the questions raised by this table of statistics are:

What are the negative payments? Near zero salaries?
Where did the “Other Pay” column go?
Are the salaries in the ‘max’ column real?
This dataset only contains one year!

These descriptive statistics hide a lot of information about the data. However, that over-simplification makes the dataset more understandable. These questions are a starting point for focusing a detailed investigation into the dataset and how well it represents the reality being analyzed.

Empirical Distribution of Salaries¶

As employee total pay is the primary attribute of focus, it makes sense to understand how it varies across the entire population of employees. Plotting the empirical distribution of salaries raises two observations:

The distribution is ‘bimodal’ and is likely comprised of two distributions.
The salaries have a skew to the right, which is typical for a quantity that can only be non-negative.

salaries['Total Pay'].plot(kind='hist', bins=50, density=True, title='distribution of Total Salaries');

../../_images/data-science-example_10_0.png

A reasonable guess for the bimodal nature of the distribution of salaries is the employment status. One would expect salaries to vary significantly based on whether an employee works Part-time versus Full-time. Splitting the population up by job status reveals two distributions:

The part-time jobs tend to have lower salaries, closer to 0,
The full-time jobs tends to have salaries centered around 80,000 USD.

bystatus = salaries.groupby('Status')
bystatus['Total Pay'].plot(kind='kde', alpha=0.5, title='Salary by Full-time/Part-time')
plt.legend(bystatus.groups);

../../_images/data-science-example_12_0.png

salaries = salaries[['Employee Name', 'Job Title', 'Total Pay', 'Status']].copy()

Do women earn similar pay to their contemporaries?¶

With a general assessment of the data finished, a more pointed questions about the data can be pursued. In this case: do employees of different genders have similar pay?

However, this dataset doesn’t contain information on the gender of employees. The dataset does have the first names of employees, which contains imperfect information about gender. A reasonable approach is to find a dataset that contains information about correspondences between names and gender.

In this case, the Social Security Administration publishes a “baby names” dataset that does exactly this.

SSA names dataset¶

The Social Security Administration compiles a list of all names on social security applications in a given year, whether the applicant identified as Male or Female. This list can then be used to label the most likely gender of the employees using their first names.

from glob import glob
import os

names_path = os.path.join('data', 'names.csv')
names = pd.read_csv(names_path)
names.head()

	firstname	gender	count	year
0	Emily	F	25956	2000
1	Hannah	F	23082	2000
2	Madison	F	19968	2000
3	Ashley	F	17997	2000
4	Sarah	F	17702	2000

Basic check of `names`:¶

There are a number of details to attend to in SSA dataset:

Many names identify to both genders (gender-neutral names).
Most names occur only a few times per year (most names are rare).
A few names make up most the applications.

Notice, the name “Madison” is mostly identified as female, though there are consistently a few males with that name as well:

# look at a single name
names[names['firstname'] == 'Madison'].sort_values(by='year', ascending=False)

	firstname	gender	count	year
1887827	Madison	M	36	2018
1866629	Madison	F	7036	2018
161087	Madison	F	7875	2017
...	...	...	...	...
1932167	Madison	M	27	1882
1843222	Madison	M	28	1881
1841285	Madison	M	22	1880

178 rows × 4 columns

Approach to joining gender:¶

Create a table of distinct names with the proportion of applications on which that name identifies as female.
That is, for each name \(N\), compute:

\[P({\rm person\ is\ female\ }|{\rm \ person\ has\ first\ name\ } N)\]

Join this table to the salaries dataset.

First, creating the table of counts of gender for each name over every year:

# Counts by gender
cnts_by_gender = names.pivot_table(
    index='firstname', 
    columns=['gender'], 
    values='count', 
    aggfunc='sum', 
    fill_value=0
)

names_idx = ['Aaron', 'Maria', 'Dakota', 'Ashley', 'Avery', 'Paris']
cnts_by_gender.loc[names_idx, :]

gender	F	M
firstname
Aaron	4307	581330
Maria	546026	4237
Dakota	33204	86089
Ashley	846120	15668
Avery	125883	55646
Paris	28841	8812

From the total counts in the above table, calculate the proportion of a given name that’s identified as female. If this number is greater than 0.5, then the name is likely associated to female; otherwise the name mostly associates to male.

# proportion of a given name that's identified female
prop_female = (cnts_by_gender['F'] / cnts_by_gender.sum(axis=1))
genders = (
    prop_female.rename('proportion of a given name that\'s identified as female').to_frame()
    .assign(**{'gender': 
               prop_female.apply(lambda x:'F' if x > 0.5 else 'M').rename('prop_female').to_frame()
              }))

genders.loc[names_idx]

Add a given name column to `salaries` and join names¶

This table of names and their most likely gender attaches a ‘most likely gender’ to the employees in the salaries dataset. This identification is approximate and doesn’t reflect the actual gender with which the employees identify.

# Add firstname column
salaries.loc[:, 'firstname'] = salaries['Employee Name'].str.split().apply(lambda x:x[0])

# join gender
salaries_with_gender = salaries.merge(genders.reset_index(), on='firstname', how='left')
salaries_with_gender.sample(5).reset_index(drop=True)

	Employee Name	Job Title	Total Pay	Status	firstname	proportion of a given name that's identified as female	gender
0	Leonardo Xxxx	Library Clerk	43336.0	FT	Leonardo	0.001915	M
1	Reina Xxxx	Accountant 3	77461.0	FT	Reina	1.000000	F
2	La Xxxx	Info Sys Anlyst 3	72983.0	FT	La	0.824943	F
3	Montserrat Xxxx	Pool Guard 1	8476.0	PT	Montserrat	1.000000	F
4	Serena Xxxx	Program Manager	122884.0	FT	Serena	0.999616	F

Do women earn similar pay to their contemporaries?¶

With a most likely gender attached to the salaries dataset, the salaries can now be described by gender:

pd.concat([
    salaries_with_gender.groupby('gender')['Total Pay'].describe().T,
    salaries_with_gender['Total Pay'].describe().rename('All')
], axis=1)

	F	M	All
count	4165.000000	7907.000000	12493.000000
mean	53970.997119	71226.907550	64984.971024
std	36251.189372	43419.913446	41812.990357
min	8.000000	3.000000	3.000000
25%	25853.000000	41557.500000	36430.000000
50%	50789.000000	67874.000000	61452.000000
75%	75128.000000	100689.000000	91400.000000
max	237512.000000	323377.000000	323377.000000

There is clearly a large difference in salaries between the males and females! This observation brings up a variety of questions:

Is this difference the result of some sort of true unfairness, or perhaps the difference is just due to chance?
If the difference isn’t due to chance, why does it exist?

Use a hypothesis test¶

Another way of phrasing the question “Is this difference the result of some sort of true unfairness, or perhaps the difference is just due to chance?” is as

Can women’s median pay be explained as a random subset of the population of city of SD salaries?

If so, the salary of women doesn’t significantly differ from the population; otherwise, some other mechanism is needed to explain the difference!

Hypothesis tests answer these sorts of questions. In particular, such a question can be simulated:

Random subsets of employees are drawn from the dataset, of the same size as the number of female employees,
The median salary of each of these random groups is calculated,
The observed salary of female employees is compared to the simulated ‘randomly drawn’ median salaries. Finally, one asks if the observed, real-life salary was just as likely drawn from a random subset of employees. If so, then the observed difference may have occurred due to chance; otherwise, something else is going on!

The plot below illustrates the results of this simulation:

The blue distribution represents the median salaries of these ‘randomly formed groups’.
The orange dot represents the real-life median female salary.

It seems unlikely this difference is due to chance!

# size of sample is number of women:
n_female = (salaries_with_gender['gender'] == 'F').sum()


# calculate observed 
female_median = salaries_with_gender.loc[salaries_with_gender['gender'] == 'F']['Total Pay'].median()


# simulate 1000 draws from the population of size n_female
medians = []
for _ in np.arange(1000):
    median = salaries_with_gender.sample(n_female)['Total Pay'].median()
    medians.append(median)
    
    
title='Median salary of randomly chosen groups from population'
pd.Series(medians).plot(kind='hist', title=title);
plt.plot([female_median], [0], marker='o', markersize=10)
plt.legend(['Observed Median Salary of Women', 'Median Salaries of Random Groups']);

Truism of data science¶

Now that the question of differences in the salaries of genders is answered, a truism of data science rears its head: answering one question always raises a dozen more.

First, are the results correct?

Is dataset error-free and representative (or error-free enough)?
Is the name-to-gender assignment correct (enough)?
What biases might have been introduced when joining the dataset of names to salaries?
Are the results applicable outside of 2017? outside of San Diego?

Second, why are the results what they are?

Is the disparity correlated to pay-type? job status? job type?
What is the cause of the disparity?

The sections below approach each of these questions, giving a feel for what’s involved in answering them.

Is the dataset representative?¶

Look up the “transparent California” and verify this dataset is a census (everyone).
Is “Total Pay” the most appropriate field to use?
Cross-reference independent counts of city of San Diego employees to assess the salary data.
What is the population of interest? Is it San Diego employees in 2017, or something more broad?

Is the name-to-gender assignment correct?¶

How many names are borderline male/female?
Does it make sense to incorporate name usage from all years in the dataset? (1880-2017?)

The plot below shows the distribution of ‘proportions of names being female.’

The bar near 0 are counts of names that are almost entirely male.
The bar near 1 are counts of names that are almost entirely female.
There are very few names in the middle that are gender-neutral.

However, each unit plotted is a distinct name; the proportions hide the number of people with each name. What if the most popular name in the country is gender-neutral? This distribution doesn’t reveal this property and is an example of what’s called an Ecological Fallacy (see the chapter on grouping and aggregation).

What’s more appropriate is to look at this distribution of confidence among the dataset of employees.

title = 'distribution of gendered-ness of names\n 0 = masculine \n 1 = feminine'
prop_female.plot(kind='hist', bins=20, title=title);

../../_images/data-science-example_33_0.png

Assessment of the join?¶

Are there names in the salary dataset that aren’t in the SSA dataset?
- Who might not be in the SSA dataset?
- Might these names be biased toward certain salaries?
Does the salary dataset have a disproportionately large portion of gender-neutral names.
Is it better to use a subset of the SSA dataset (e.g. by state? by year?)
- Do the gender of names typically vary by geography or over time?

The proportion of employees not in the SSA data is 3.5%, which is fairly small, but may affect the results. These individuals should be investigated more closely; a look at the employees with a gender assigned versus those that didn’t appear in the SSA names dataset reveals some bias (see table below).

Perhaps those that didn’t appear in the names dataset have lower salaries because they belong to an uncommon ethnic group (e.g. an immigrant group)? Such populations would likely work in jobs that earn lower salaries. One could further clean up these missing genders by incorporating the demographic information from immigration data.

These problems centered around bias from missing data are covered in the ‘Missing Data’ chapter of the book.

# proportion of employees not in SSA dataset
# salaries_with_gender['gender'].isnull().mean()

# Description of total pay by joined vs not joined
(
    salaries_with_gender
    .assign(joined=salaries_with_gender['gender'].notnull())
    .groupby('joined')['Total Pay']
    .describe()
    .T
)

joined	False	True
count	421.000000	12072.000000
mean	56714.684086	65273.389745
std	38445.038529	41897.581538
...	...	...
50%	54966.000000	61653.000000
75%	78376.000000	91895.000000
max	194920.000000	323377.000000

8 rows × 2 columns

nonjoins = salaries_with_gender.loc[salaries_with_gender['gender'].isnull()]

title = 'Distribution of Salaries'
nonjoins['Total Pay'].plot(kind='hist', bins=50, alpha=0.5, density=True, sharex=True)
salaries_with_gender['Total Pay'].plot(kind='hist', bins=50, alpha=0.5, density=True, sharex=True, title=title)
plt.legend(['Not in SSA','All']);

../../_images/data-science-example_36_0.png

Why does pay disparity exist?¶

Is the pay disparity correlated to another field? job status? job type? Is the proportion of women in a job type correlated to pay? One approach might ask if women earn similar salaries as men for a given job type.

Below, a few job types are isolated for investigation. For example, those who work in ‘Fire’ related fields tend to be male and make high salaries:

# select jobs with word 'fire' in them
firejobs = salaries_with_gender.loc[salaries_with_gender['Job Title'].str.contains('Fire')]
firejobs.sample(5)

	Employee Name	Job Title	Total Pay	Status	firstname	proportion of a given name that's identified as female	gender
342	Michael Xxxx	Fire Engineer	150291.0	PT	Michael	0.004957	M
555	Dustin Xxxx	Fire Fighter 2	134974.0	FT	Dustin	0.006679	M
12027	Chase Xxxx	Fire Fighter 2	3900.0	PT	Chase	0.021086	M
5718	Nick Xxxx	Fire Dispatch Supv	62799.0	FT	Nick	0.002659	M
2627	Dave Xxxx	Fire Engineer	91446.0	FT	Dave	0.001780	M

The proportion of fire-related jobs held by women is only 8.8%, yet fire-related jobs make significantly more than the overall median pay of 61,000USD per year:

# Proportion of fire-related jobs held by women
#(firejobs['gender'] == 'F').mean()

# Pay Statistics for fire-related jobs
firejobs['Total Pay'].describe()

count      1017.000000
mean     110967.646018
std       49678.113970
             ...      
50%      112568.000000
75%      141502.000000
max      323377.000000
Name: Total Pay, Length: 8, dtype: float64

On the other hand, those with library-related jobs tend to be female and make lower-than-average salaries:

# select jobs with library related jobs
libjobs = salaries_with_gender.loc[salaries_with_gender['Job Title'].str.contains('Librar')]
libjobs.sample(5)

	Employee Name	Job Title	Total Pay	Status	firstname	proportion of a given name that's identified as female	gender
5955	Patricia Xxxx	Librarian 2	63320.0	FT	Patricia	0.996853	F
10363	Joseph Xxxx	Library Aide	17777.0	PT	Joseph	0.004060	M
11194	Jaqueline Xxxx	Library Aide	10189.0	PT	Jaqueline	0.998469	F
8706	Alexis Xxxx	Library Clerk	39696.0	FT	Alexis	0.841756	F
8430	Steven Xxxx	Library Clerk	43056.0	FT	Steven	0.003469	M

The proportion of library-related jobs held by women is 64%, yet library related jobs make significantly less than the overall median salary:

# Proportion of library-related jobs held by women
#(libjobs['gender'] == 'F').mean()

# Pay Statistics for fire-related jobs
libjobs['Total Pay'].describe()

count       651.000000
mean      30383.377880
std       23236.318495
min          23.000000
25%       10598.500000
50%       26952.000000
75%       43566.000000
max      167269.000000
Name: Total Pay, dtype: float64

What is the cause of the disparity?¶

Now that the picture of employee salaries is better understood the next steps all consist of domain more research:

Research historical gender preferences across jobs
list possibilities and formulate hypothesis for a cause,
find data capable of describing these possibilities,
use a natural experiment to argue causality.

Data Science in Practice

An Example Investigation

Contents

An Example Investigation¶

San Diego City Salaries¶

SD Employee Salaries: background research¶

An initial look: what to keep in mind¶

Aside on privacy and ethics¶

Basic description of employee pay¶

Empirical Distribution of Salaries¶

Do women earn similar pay to their contemporaries?¶

SSA names dataset¶

Basic check of `names`:¶

Approach to joining gender:¶

Add a given name column to `salaries` and join names¶

Do women earn similar pay to their contemporaries?¶

Use a hypothesis test¶

Truism of data science¶

Is the dataset representative?¶

Is the name-to-gender assignment correct?¶

Assessment of the join?¶

Why does pay disparity exist?¶

What is the cause of the disparity?¶

Data Science in Practice

An Example Investigation

Contents

An Example Investigation¶

San Diego City Salaries¶

SD Employee Salaries: background research¶

An initial look: what to keep in mind¶

Aside on privacy and ethics¶

Basic description of employee pay¶

Empirical Distribution of Salaries¶

Do women earn similar pay to their contemporaries?¶

SSA names dataset¶

Basic check of names:¶

Approach to joining gender:¶

Add a given name column to salaries and join names¶

Do women earn similar pay to their contemporaries?¶

Use a hypothesis test¶

Truism of data science¶

Is the dataset representative?¶

Is the name-to-gender assignment correct?¶

Assessment of the join?¶

Why does pay disparity exist?¶

What is the cause of the disparity?¶

Basic check of `names`:¶

Add a given name column to `salaries` and join names¶