{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "pd.set_option('display.max_rows', 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Parsing HTML\n", "---\n", "\n", "The previous section described the process of collecting data over a network via HTTP requests. In particular, when scraping websites, these requests result in the collection of raw source data in the form of HTML.\n", "\n", "HyperText Markup Language, or HTML, defines the structure of web content rendered on in a web browser. Thus, if a dataset requires extracting information from a website, the content must be found in, and retrieved from, the HTML.\n", "\n", "Understanding how HTML visually represents a website helps write more robust data-extraction code. While HTML can be treated solely as text, using the structure helps the developer write code that easily adapts to both changing requirements in the data collection, as well as evolving website source." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Anatomy of HTML\n", "\n", "A website, represented in HTML, is described using the framework of the Document Object Model:\n", "\n", "* The *HTML Document* is the totality of the markup that makes up a website.\n", "* The *Document Object Model* (DOM) is the internal representation of an HTML document as a *tree* structure.\n", "* An HTML *Element* is a subtree of the document. Visually, elements are regions of the webpage.\n", "* HTML *Tags* are markers that denote the start and end of an element.\n", "\n", "**Example:** The basic website below, is represented as: the document rendered by the browser, the HTML source code, and the DOM tree." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The root of document tree is the `` element, which contains all the HTML source. The root typically contains two children: the head, containing metadata for the page, and the body, which contains the information rendered on the page itself. The body of this page consists of three portions: the header and two numbered sections, each of which includes a section header and text. Notice that all of these portions consist of subtrees themselves." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Common tags\n", "\n", "Tags define the visual appearance of a particular element. They typically fall into two different types:\n", "1. tags defining structural elements (regions of the page), and\n", "1. tags defining stylistic elements (e.g. formatting).\n", "\n", "The table below summarizes the most useful tags:\n", "\n", "|Structure Elements|Description|Head/Body Elements|Description|\n", "|---|---|---|---|\n", "|``|the document|`

`|the paragraph|\n", "|``|the header|`

,

, ...`|header(s)|\n", "|``|the body|``|images|\n", "|`
` |a logical division of the document|``| anchor (hyper-link)|\n", "|``|an *in-line* logical division|[MANY MORE](https://en.wikipedia.org/wiki/HTML_element)||\n", "\n", "For data collection, `div` tags are particularly important: when collecting data on a collection of websites, selecting the subtree defined by a `div` tag that defines the area containing the data yields clearer, less error-prone parsing code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing HTML with Beautiful Soup\n", "\n", "The python library Beautiful Soup 4, or `bs4`, parses strings or file-like objects representing HTML. The constructor `BeautifulSoup(page)` returns a `BeautifulSoup` object representing a *parsed document* as a tree-structure.\n", "\n", "**Example:** The simple HTML document defined in the string below serves to illustrate the attributes and methods of the `BeautifulSoup` class:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "s = '''\n", "\n", "\n", "
\n", "

Heading here

\n", "

My First paragraph

\n", "

My second paragraph

\n", "
\n", " \n", "
\n", "
    \n", "
  • item 1
  • \n", "
  • item 2
  • \n", "
  • item 3
  • \n", "
\n", "
\n", "\n", "\n", "'''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The HTML can be rendered in notebooks using the `IPython.display` module:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "

Heading here

\n", "

My First paragraph

\n", "

My second paragraph

\n", "
\n", " \n", "
\n", "
    \n", "
  • item 1
  • \n", "
  • item 2
  • \n", "
  • item 3
  • \n", "
\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from IPython.display import HTML\n", "HTML(s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parsing the HTML string into a document with BeautifulSoup, resulting object can be explored:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.BeautifulSoup" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bs4\n", "soup = bs4.BeautifulSoup(s)\n", "type(soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The elements of a document can be retrieved by specifying a desired tag to the `find` or `find_all` method. For example, to retrieve all list elements in the document:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[
  • item 1
  • ,
  • item 2
  • ,
  • item 3
  • ]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_items = soup.find_all('li')\n", "list_items" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each of the items in the resulting list are document elements; the text displayed in those elements may be retrieved with the `text` attribute:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "bs4.element.Tag" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(list_items[0])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'item 1'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_items[0].text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select the top-portion of the page, consisting of the *content*, one can select the `div` with `find`, using the `attrs` keyword to specify the desired `div` element:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "
    \n", "

    Heading here

    \n", "

    My First paragraph

    \n", "

    My second paragraph

    \n", "
    \n", "
    " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content = soup.find('div', attrs={'id': 'content'})\n", "content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `find` method also supports querying elements based on text they contain. For example, to retrieve the text of every element that contains the text 'paragraph', pass a predicate function matching the text 'paragraph':" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['My First paragraph', ' paragraph']" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "soup.find_all(text=lambda x:'paragraph' in x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The entirety of any document tree may be traversed, depth-first, with the `descendants` method, which returns an iterator." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for elt in soup.descendants:\n", " process(elt)\n", " ..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }