Text Data

Once data are collected and transformed into a tabular format, with observations and attributes, the individual entries are often raw text. Initially, these text fields contain informations that are not quantitatively usable. This chapter covers extraction of information from text, resulting in a table that amenable to study using the techniques from the first part of the book.

Pattern Matching

An effective, simple approach to extracting useful information from text is to find patterns that correlate with the concept being measured.

Example: The table jobs below contains the job title of every San Diego city employee in 2017. In chapter 1, the investigation into the salaries finished with the question:

When controlling for ‘job type’, do women makes significantly less than their contemporaries?

However, the ‘Job Title’ field in the dataset is messy. Many related jobs are described in different ways; most job titles are distinct in text, even if their are similar in reality. When should two jobs be considered of the same type?

Job Title
0 Fire Battalion Chief
1 Fire Captain
2 Chief Operating Officer
... ...
12490 Council Rep 2 A
12491 Sr Mgmt Anlyst
12492 police officer

12493 rows × 1 columns

The empirical distribution of ‘Job Title’ meaningfully differentiate between different jobs:

jobs['Job Title'].value_counts(normalize=True)
Police Officer                          0.113183
police officer                          0.045065
Rec Leader 1                            0.016889
Storm Water Environmental Specialist    0.000080
construction estimator                  0.000080
Info Sys Admnstr (Bus Sys Admin)        0.000080
Name: Job Title, Length: 1072, dtype: float64

Pattern matching: a general approach

One approach to extracting information from text fields, is to ‘measure the text’ for useful information. For example, in the table of job titles:

  • How many employees do police work?

  • How many employees work in library-related work?

  • How many employees manage other people?

Approaching question like these follows a simple procedure:

  1. Choose an initial pattern on which to match,

  2. Assess whether the pattern is too narrow or broad:

    • Examine at non-matching observations to assess if the pattern misses individuals,

    • Examine the matched observations to assess if the patter captures unintended individuals

  3. Make hypotheses and generalizations about what the text data look like, and test them.

Notice that the correctness of such an intuitive pattern matching approach is not verifiable! The most one can hope for is falsification: an accrual of evidence that the pattern captures the indended concept.

Example: The question of how many employees do police work is one of the easier questions to ask of the jobs dataset. As seen from the empirical distribution of job titles, both ‘Police Officer’ and ‘police officer’ are common job titles. Ignoring the case of the words, a first reasonable pattern to consider is police.

The Series method contains in the str namespace performs matching on regular-expressions, returning a boolean array:

contains_police = jobs['Job Title'].str.contains('police', case=False)
jobs.loc[contains_police, 'Job Title'].value_counts()
Police Officer                      1414
police officer                       563
Police Dispatcher                     46
police code compl ofcr                 1
executive assistant police chief       1
police records data spec supv          1
Name: Job Title, Length: 36, dtype: int64

This naive pattern returns approximately 2000 job titles relating to police. Did this pattern miss other police related jobs? The dataset contains abbreviations, so perhaps likely abbreviations of ‘police’ should also be tried:

contains_pol = jobs['Job Title'].str.contains('pol', case=False)
jobs.loc[contains_pol & ~contains_police, 'Job Title'].value_counts()
Clerical Asst 2(Pol Clrk)    5
clerical asst 2(pol clrk)    2
clerical asst 1(pol clrk)    1
Name: Job Title, dtype: int64

Eight possible observations related to police work are returned. If ‘pol clrk’ stands for ‘Police Clerk’, this more general pattern is better to use; outside research is needed to answer this question.

Are there job titles in the area of police work that might contain the word ‘police’? A next step would be to propose new patterns that might capture police-related work. This may be approached via:

  • choosing police-related words that commonly appear in the job titles what contained the initial pattern, or

  • researching the domain and generating a list of keywords by hand.

For example, ‘Police Dispatcher’ appears in the pattern matching above. Does ‘Dispatch’ appear more broadly?

contains_dispatch = jobs['Job Title'].str.contains('dispatch', case=False)
jobs.loc[contains_dispatch, 'Job Title'].value_counts()
Dispatcher 2                   57
Police Dispatcher              46
Dispatcher 1                   29
fire dispatch supv              1
Fire Dispatch Administrator     1
public works dispatch supv      1
Name: Job Title, Length: 19, dtype: int64

Other instances of ‘Dispatcher’ appear, including ‘fire dispatch’ and ‘public works dispatch’. These observations raise the question of whether the generic ‘Dispatcher’ titles are police related or not. More domain research might be necessary.

Another possibility might be to search for the term ‘Crime’ in the job titles, which would almost certainly be police-related:

contains_crime = jobs['Job Title'].str.contains('Crime', case=False)
jobs.loc[contains_crime, 'Job Title'].value_counts()
Crime Scene Specialist         7
crime scene specialist         2
Crime Laboratory Manager       1
Supv Crime Scene Specialist    1
Name: Job Title, dtype: int64

This additional pattern match results in new police-related job titles. These patterns can be combined using a regular-expression:

police_jobs = jobs['Job Title'].str.contains('pol|crime', case=False)
jobs.loc[police_jobs, 'Job Title'].value_counts()
Police Officer                    1414
police officer                     563
Police Dispatcher                   46
sr police records clerk              1
Conf Secretary to Police Chief       1
police records data spec supv        1
Name: Job Title, Length: 43, dtype: int64

Of course, this pattern is likely not exhaustive; it is mere better than the initial pattern. This process must continue until the results are good enough to use.


In the job titles dataset, the individual job titles often represented the same job title in different ways. For example, the job of a police officer was represented both in lower-case (‘police officer’) and upper-case (‘Police Officer’). Transforming these different representations into a single form helps simplify the difficult process of finding appropriate patterns. This process is called canonicalization.

Datasets often have inconsistencies:

  • Some text might contain upper-case letters, while others are lower-case.

  • Some text may contain abbreviations. further, the abbreviations may not be consistent.

  • Punctuation may be used inconsistently.

  • Text may contains superfluous information.

In each of these examples, these inconsistencies require developing more sophisticated patterns for extracting the needed information from the text. Taking care of these inconsistencies before attempting to extract information simplifies the process.

Canonicalization of text content refers to a function that chooses a standard form in which to represent each value.

Example: Canonicalizing the job titles involves dealing with all of the inconsistencies listed above. In order of increasing difficulty, the canonicalizing job title will handing the following issues:

  1. handle mixed cases by transforming all characters to lower-case,

  2. handle inconsistent use of punctuation by removing punctuation,

  3. handle abbreviations by matching them to known words.

Inconsistent Case. The method lower transforms the case of the characters in the ‘Job Title’ columns:

jobs['Job Title'].str.lower()
0           fire battalion chief
1                   fire captain
2        chief operating officer
12490            council rep 2 a
12491             sr mgmt anlyst
12492             police officer
Name: Job Title, Length: 12493, dtype: object

Punctuation and Non-Alphanumeric Characters.

Carefully replacing the punctuation involves understanding what punctuation is used and whether the existing punctuation is necessary information to keep.

Selecting job titles that contain non-alphanumeric characters, shows a number of usages of special characters:

  • The & symbol represents the word and (and that meaning should not be lost),

  • The / symbol separates two words without a space (and so should be replaced with a space),

  • - separates words with spaces, and should be replaced without spaces.

jobs[jobs['Job Title'].str.contains('[^A-Za-z0-9\s]')]
Job Title
95 Deputy City Atty - Unrep
154 Park & Recreation Director
159 Asst Fire Marshal/Civ
... ...
12430 Rec Leader 2(Dance Instr)
12432 Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)
12481 Clerical Asst 2(Temp Pool)

1145 rows × 1 columns

As 1000 observations are too many to look at by hand, use the extract method to create a full list of non-alphanumeric characters:

    jobs['Job Title']
    .str.extractall('([^A-Za-z0-9\s])') # returns a multi-index for > 1 match
-    695
)    295
(    295
/    150
&     83
'     23
,      4
Name: 0, dtype: int64

The totality of non-alphanumeric characters includes - ( ) / & ' ,, each of which should be handled differently.

    jobs['Job Title']
    .str.lower()  # lower case
    .str.replace('&', 'and') # replace '&' with 'and'
    .str.replace('[^A-Za-z0-9\s]', ' ') # replace all other punctuation with space
    .str.replace('\s+', ' ') # collapse multiple whitespace down to one.
0           fire battalion chief
1                   fire captain
2        chief operating officer
12490            council rep 2 a
12491             sr mgmt anlyst
12492             police officer
Name: Job Title, Length: 12493, dtype: object


Lastly, job titles have different abbreviations that are used inconsistently across the dataset. For example, ‘analyst’ job titles may either be represented as either ‘Analyst’ or ‘Anlyst’:

jobs[jobs['Job Title'].str.lower().str.contains('analyst|anlyst')]
Job Title
14 Independent Budget Anlyst
1032 Budget/Legislative Analyst 1
1658 Budget/Legislative Analyst 1
... ...
12432 Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)
12454 sr mgmt anlyst
12491 Sr Mgmt Anlyst

504 rows × 1 columns

Canonicalizing abbreviations is a harder task and the way with which it’s dealt depends on what’s being done with the data. A few things to keep in mind:

  • Is it import to understand what the abbreviation means, or just that it’s consistently used throughout the dataset?

  • To find instances of abbreviations that might not be used consistently

    • use a dictionary to find non-words,

    • use edit-distance functions to find small variations between words that might have similar meaning.