{ "cells": [ { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "pd.set_option('display.max_rows', 7)\n", "\n", "jobs = pd.read_csv('../01/data/san-diego-2017.csv', usecols=['Job Title'])\n", "idx = jobs.sample(frac=0.3).index\n", "jobs.loc[idx, 'Job Title'] = jobs.loc[idx, 'Job Title'].str.lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Data\n", "\n", "---\n", "\n", "Once data are collected and transformed into a tabular format, with observations and attributes, the individual entries are often raw text. Initially, these text fields contain informations that are not quantitatively usable. This chapter covers extraction of information from text, resulting in a table that amenable to study using the techniques from the first part of the book." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pattern Matching\n", "\n", "An effective, simple approach to extracting useful information from text is to find patterns that correlate with the concept being measured.\n", "\n", "**Example:** The table `jobs` below contains the job title of every San Diego city employee in 2017. In chapter 1, the investigation into the salaries finished with the question: \n", "> When controlling for 'job type', do women makes significantly less than their contemporaries?\n", "\n", "However, the 'Job Title' field in the dataset is messy. Many related jobs are described in different ways; most job titles are distinct in text, even if their are similar in reality. When should two jobs be considered of the same type?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Job Title
0Fire Battalion Chief
1Fire Captain
2Chief Operating Officer
......
12490Council Rep 2 A
12491Sr Mgmt Anlyst
12492police officer
\n", "

12493 rows × 1 columns

\n", "
" ], "text/plain": [ " Job Title\n", "0 Fire Battalion Chief\n", "1 Fire Captain\n", "2 Chief Operating Officer\n", "... ...\n", "12490 Council Rep 2 A\n", "12491 Sr Mgmt Anlyst\n", "12492 police officer\n", "\n", "[12493 rows x 1 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The empirical distribution of 'Job Title' meaningfully differentiate between different jobs:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Police Officer 0.113183\n", "police officer 0.045065\n", "Rec Leader 1 0.016889\n", " ... \n", "Storm Water Environmental Specialist 0.000080\n", "construction estimator 0.000080\n", "Info Sys Admnstr (Bus Sys Admin) 0.000080\n", "Name: Job Title, Length: 1072, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs['Job Title'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pattern matching: a general approach\n", "One approach to extracting information from text fields, is to 'measure the text' for useful information. For example, in the table of job titles:\n", "* How many employees do police work?\n", "* How many employees work in library-related work?\n", "* How many employees manage other people?\n", "\n", "Approaching question like these follows a simple procedure:\n", "1. Choose an initial pattern on which to match,\n", "2. Assess whether the pattern is too narrow or broad:\n", " - Examine at non-matching observations to assess if the pattern misses individuals,\n", " - Examine the matched observations to assess if the patter captures unintended individuals\n", "3. Make hypotheses and generalizations about what the text data look like, and test them.\n", "\n", "Notice that the correctness of such an intuitive pattern matching approach is not verifiable! The most one can hope for is *falsification*: an accrual of evidence that the pattern captures the indended concept." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** The question of how many employees do police work is one of the easier questions to ask of the `jobs` dataset. As seen from the empirical distribution of job titles, both 'Police Officer' and 'police officer' are common job titles. Ignoring the case of the words, a first reasonable pattern to consider is `police`.\n", "\n", "The Series method `contains` in the `str` namespace performs matching on regular-expressions, returning a boolean array:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Police Officer 1414\n", "police officer 563\n", "Police Dispatcher 46\n", " ... \n", "police code compl ofcr 1\n", "executive assistant police chief 1\n", "police records data spec supv 1\n", "Name: Job Title, Length: 36, dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contains_police = jobs['Job Title'].str.contains('police', case=False)\n", "jobs.loc[contains_police, 'Job Title'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This naive pattern returns approximately 2000 job titles relating to police. Did this pattern miss other police related jobs? The dataset contains abbreviations, so perhaps likely abbreviations of 'police' should also be tried:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Clerical Asst 2(Pol Clrk) 5\n", "clerical asst 2(pol clrk) 2\n", "clerical asst 1(pol clrk) 1\n", "Name: Job Title, dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contains_pol = jobs['Job Title'].str.contains('pol', case=False)\n", "jobs.loc[contains_pol & ~contains_police, 'Job Title'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Eight possible observations related to police work are returned. If 'pol clrk' stands for 'Police Clerk', this more general pattern is better to use; outside research is needed to answer this question.\n", "\n", "Are there job titles in the area of police work that might contain the word 'police'? A next step would be to propose new patterns that might capture police-related work. This may be approached via:\n", "* choosing police-related words that commonly appear in the job titles what contained the initial pattern, or\n", "* researching the domain and generating a list of keywords by hand.\n", "\n", "For example, 'Police Dispatcher' appears in the pattern matching above. Does 'Dispatch' appear more broadly?" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dispatcher 2 57\n", "Police Dispatcher 46\n", "Dispatcher 1 29\n", " ..\n", "fire dispatch supv 1\n", "Fire Dispatch Administrator 1\n", "public works dispatch supv 1\n", "Name: Job Title, Length: 19, dtype: int64" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contains_dispatch = jobs['Job Title'].str.contains('dispatch', case=False)\n", "jobs.loc[contains_dispatch, 'Job Title'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other instances of 'Dispatcher' appear, including 'fire dispatch' and 'public works dispatch'. These observations raise the question of whether the generic 'Dispatcher' titles are police related or not. More domain research might be necessary.\n", "\n", "Another possibility might be to search for the term 'Crime' in the job titles, which would almost certainly be police-related:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Crime Scene Specialist 7\n", "crime scene specialist 2\n", "Crime Laboratory Manager 1\n", "Supv Crime Scene Specialist 1\n", "Name: Job Title, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contains_crime = jobs['Job Title'].str.contains('Crime', case=False)\n", "jobs.loc[contains_crime, 'Job Title'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This additional pattern match results in new police-related job titles. These patterns can be combined using a regular-expression:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Police Officer 1414\n", "police officer 563\n", "Police Dispatcher 46\n", " ... \n", "sr police records clerk 1\n", "Conf Secretary to Police Chief 1\n", "police records data spec supv 1\n", "Name: Job Title, Length: 43, dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "police_jobs = jobs['Job Title'].str.contains('pol|crime', case=False)\n", "jobs.loc[police_jobs, 'Job Title'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, this pattern is likely not exhaustive; it is mere better than the initial pattern. This process must continue until the results are good enough to use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Canonicalization \n", "\n", "In the job titles dataset, the individual job titles often represented the same job title in different ways. For example, the job of a police officer was represented both in lower-case ('police officer') and upper-case ('Police Officer'). Transforming these different representations into a single form helps simplify the difficult process of finding appropriate patterns. This process is called *canonicalization*.\n", "\n", "Datasets often have inconsistencies:\n", "* Some text might contain upper-case letters, while others are lower-case.\n", "* Some text may contain abbreviations. further, the abbreviations may not be consistent.\n", "* Punctuation may be used inconsistently.\n", "* Text may contains superfluous information.\n", "\n", "In each of these examples, these inconsistencies require developing more sophisticated patterns for extracting the needed information from the text. Taking care of these inconsistencies *before* attempting to extract information simplifies the process. \n", "\n", "Canonicalization of text content refers to a function that chooses a standard form in which to represent each value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example:** Canonicalizing the job titles involves dealing with all of the inconsistencies listed above. In order of increasing difficulty, the canonicalizing job title will handing the following issues:\n", "\n", "1. handle mixed cases by transforming all characters to lower-case,\n", "1. handle inconsistent use of punctuation by removing punctuation,\n", "1. handle abbreviations by matching them to known words.\n", "\n", "\n", "**Inconsistent Case**. The method `lower` transforms the case of the characters in the 'Job Title' columns:" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 fire battalion chief\n", "1 fire captain\n", "2 chief operating officer\n", " ... \n", "12490 council rep 2 a\n", "12491 sr mgmt anlyst\n", "12492 police officer\n", "Name: Job Title, Length: 12493, dtype: object" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs['Job Title'].str.lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Punctuation and Non-Alphanumeric Characters**.\n", "\n", "Carefully replacing the punctuation involves understanding what punctuation is used and whether the existing punctuation is necessary information to keep.\n", "\n", "Selecting job titles that contain non-alphanumeric characters, shows a number of usages of special characters:\n", "* The `&` symbol represents the word `and` (and that meaning should not be lost),\n", "* The `/` symbol separates two words without a space (and so should be replaced with a space),\n", "* `-` separates words *with* spaces, and should be replaced without spaces." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Job Title
95Deputy City Atty - Unrep
154Park & Recreation Director
159Asst Fire Marshal/Civ
......
12430Rec Leader 2(Dance Instr)
12432Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)
12481Clerical Asst 2(Temp Pool)
\n", "

1145 rows × 1 columns

\n", "
" ], "text/plain": [ " Job Title\n", "95 Deputy City Atty - Unrep\n", "154 Park & Recreation Director\n", "159 Asst Fire Marshal/Civ\n", "... ...\n", "12430 Rec Leader 2(Dance Instr)\n", "12432 Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)\n", "12481 Clerical Asst 2(Temp Pool)\n", "\n", "[1145 rows x 1 columns]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs[jobs['Job Title'].str.contains('[^A-Za-z0-9\\s]')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As 1000 observations are too many to look at by hand, use the `extract` method to create a full list of non-alphanumeric characters:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "- 695\n", ") 295\n", "( 295\n", "/ 150\n", "& 83\n", "' 23\n", ", 4\n", "Name: 0, dtype: int64" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " jobs['Job Title']\n", " .str.extractall('([^A-Za-z0-9\\s])') # returns a multi-index for > 1 match\n", " .dropna()\n", " .reset_index(drop=True)\n", " .squeeze()\n", " .value_counts()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The totality of non-alphanumeric characters includes `- ( ) / & ' ,`, each of which should be handled differently." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 fire battalion chief\n", "1 fire captain\n", "2 chief operating officer\n", " ... \n", "12490 council rep 2 a\n", "12491 sr mgmt anlyst\n", "12492 police officer\n", "Name: Job Title, Length: 12493, dtype: object" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", " jobs['Job Title']\n", " .str.lower() # lower case\n", " .str.replace('&', 'and') # replace '&' with 'and'\n", " .str.replace('[^A-Za-z0-9\\s]', ' ') # replace all other punctuation with space\n", " .str.replace('\\s+', ' ') # collapse multiple whitespace down to one.\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Abbreviations**.\n", "\n", "Lastly, job titles have different abbreviations that are used inconsistently across the dataset. For example, 'analyst' job titles may either be represented as either 'Analyst' or 'Anlyst':" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Job Title
14Independent Budget Anlyst
1032Budget/Legislative Analyst 1
1658Budget/Legislative Analyst 1
......
12432Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)
12454sr mgmt anlyst
12491Sr Mgmt Anlyst
\n", "

504 rows × 1 columns

\n", "
" ], "text/plain": [ " Job Title\n", "14 Independent Budget Anlyst\n", "1032 Budget/Legislative Analyst 1\n", "1658 Budget/Legislative Analyst 1\n", "... ...\n", "12432 Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord)\n", "12454 sr mgmt anlyst\n", "12491 Sr Mgmt Anlyst\n", "\n", "[504 rows x 1 columns]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jobs[jobs['Job Title'].str.lower().str.contains('analyst|anlyst')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Canonicalizing abbreviations is a harder task and the way with which it's dealt depends on what's being done with the data. A few things to keep in mind:\n", "* Is it import to understand *what* the abbreviation means, or just that it's consistently used throughout the dataset?\n", "* To find instances of abbreviations that might not be used consistently\n", " - use a dictionary to find non-words, \n", " - use edit-distance functions to find small variations between words that might have similar meaning." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }