{ "cells": [ { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "pd.set_option('display.max_rows', 7)\n", "\n", "jobs = pd.read_csv('../01/data/san-diego-2017.csv', usecols=['Job Title'])\n", "idx = jobs.sample(frac=0.3).index\n", "jobs.loc[idx, 'Job Title'] = jobs.loc[idx, 'Job Title'].str.lower()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Data\n", "\n", "---\n", "\n", "Once data are collected and transformed into a tabular format, with observations and attributes, the individual entries are often raw text. Initially, these text fields contain informations that are not quantitatively usable. This chapter covers extraction of information from text, resulting in a table that amenable to study using the techniques from the first part of the book." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pattern Matching\n", "\n", "An effective, simple approach to extracting useful information from text is to find patterns that correlate with the concept being measured.\n", "\n", "**Example:** The table `jobs` below contains the job title of every San Diego city employee in 2017. In chapter 1, the investigation into the salaries finished with the question: \n", "> When controlling for 'job type', do women makes significantly less than their contemporaries?\n", "\n", "However, the 'Job Title' field in the dataset is messy. Many related jobs are described in different ways; most job titles are distinct in text, even if their are similar in reality. When should two jobs be considered of the same type?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Job Title | \n", "
---|---|
0 | \n", "Fire Battalion Chief | \n", "
1 | \n", "Fire Captain | \n", "
2 | \n", "Chief Operating Officer | \n", "
... | \n", "... | \n", "
12490 | \n", "Council Rep 2 A | \n", "
12491 | \n", "Sr Mgmt Anlyst | \n", "
12492 | \n", "police officer | \n", "
12493 rows × 1 columns
\n", "\n", " | Job Title | \n", "
---|---|
95 | \n", "Deputy City Atty - Unrep | \n", "
154 | \n", "Park & Recreation Director | \n", "
159 | \n", "Asst Fire Marshal/Civ | \n", "
... | \n", "... | \n", "
12430 | \n", "Rec Leader 2(Dance Instr) | \n", "
12432 | \n", "Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord) | \n", "
12481 | \n", "Clerical Asst 2(Temp Pool) | \n", "
1145 rows × 1 columns
\n", "\n", " | Job Title | \n", "
---|---|
14 | \n", "Independent Budget Anlyst | \n", "
1032 | \n", "Budget/Legislative Analyst 1 | \n", "
1658 | \n", "Budget/Legislative Analyst 1 | \n", "
... | \n", "... | \n", "
12432 | \n", "Asst Mgmt Anlyst(Litrcy Tut/Lrng Coord) | \n", "
12454 | \n", "sr mgmt anlyst | \n", "
12491 | \n", "Sr Mgmt Anlyst | \n", "
504 rows × 1 columns
\n", "