{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"\n",
"pd.set_option('display.max_rows', 7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parsing HTML\n",
"---\n",
"\n",
"The previous section described the process of collecting data over a network via HTTP requests. In particular, when scraping websites, these requests result in the collection of raw source data in the form of HTML.\n",
"\n",
"HyperText Markup Language, or HTML, defines the structure of web content rendered on in a web browser. Thus, if a dataset requires extracting information from a website, the content must be found in, and retrieved from, the HTML.\n",
"\n",
"Understanding how HTML visually represents a website helps write more robust data-extraction code. While HTML can be treated solely as text, using the structure helps the developer write code that easily adapts to both changing requirements in the data collection, as well as evolving website source."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Anatomy of HTML\n",
"\n",
"A website, represented in HTML, is described using the framework of the Document Object Model:\n",
"\n",
"* The *HTML Document* is the totality of the markup that makes up a website.\n",
"* The *Document Object Model* (DOM) is the internal representation of an HTML document as a *tree* structure.\n",
"* An HTML *Element* is a subtree of the document. Visually, elements are regions of the webpage.\n",
"* HTML *Tags* are markers that denote the start and end of an element.\n",
"\n",
"**Example:** The basic website below, is represented as: the document rendered by the browser, the HTML source code, and the DOM tree."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The root of document tree is the `` element, which contains all the HTML source. The root typically contains two children: the head, containing metadata for the page, and the body, which contains the information rendered on the page itself. The body of this page consists of three portions: the header and two numbered sections, each of which includes a section header and text. Notice that all of these portions consist of subtrees themselves."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Common tags\n",
"\n",
"Tags define the visual appearance of a particular element. They typically fall into two different types:\n",
"1. tags defining structural elements (regions of the page), and\n",
"1. tags defining stylistic elements (e.g. formatting).\n",
"\n",
"The table below summarizes the most useful tags:\n",
"\n",
"|Structure Elements|Description|Head/Body Elements|Description|\n",
"|---|---|---|---|\n",
"|``|the document|`
`|the paragraph|\n", "|`
`|the header|`