Back to contents PHP Python Ruby Choose a language:

The easiest and most familiar way to extract data from HTML web pages is to use "CSS selectors". These are part of the same rules which in web stylesheets are used to describe the spacing, colour and layout of web pages.

For more details, read the Simple HTML DOM documentation, or the CSS selector specification.

Getting started

Grab the HTML web page, and parse the HTML using Simple HTML DOM.

require 'scraperwiki/simple_html_dom.php'; $html_content = scraperwiki::scrape(""); $html = str_get_html($html_content);

Select all <a> elements that are inside <div class="featured">. These queries work the same way as CSS stylesheets or jQuery. They are called CSS selectors, and are quite powerful.

foreach ($html->find("div.featured a") as $el) { print $el . "\n"; }

Read attributes, such as the target of the <a> tags (put this inside the "foreach" loop, before the "}").

print $el->href . "\n";

Text extraction

Select the first <strong> element inside <div id="footer_inner">.

$el = $html->find("div#footer_inner strong",0); print $el . "\n";

Extract the text from inside the tag.

print $el->innertext . "\n";

Get all text recursively, throwing away any child tags.

$eg = str_get_html('<h2>A thing <b>goes boom</b> up <i>on <em>the tree</em></i></h2>'); print $eg->plaintext . "\n"; // 'A thing goes boom up on the tree'

Finding data manually

Iterate down through the elements in the document and see the tags and attributes on each element.

$html_el = $html->find("html", 0); foreach ($html_el->children() as $child1) { print $child1->tag . "\n"; foreach ($child1->children() as $child2) { print "-- " . $child2->tag . " "; print json_encode($child2->attr) . "\n"; } }

Navigate around the document.

$eg = str_get_html('<h2>A thing <b>goes boom</b> up <i>on <em>the tree</em></i></h2>'); print $eg->root->first_child()->tag . "\n"; # h2 print $eg->root->first_child()->children(0)->tag . "\n"; # b print $eg->root->first_child()->children(0)->next_sibling()->tag . "\n"; # i print $eg->root->first_child()->children(1)->tag . "\n"; # i print $eg->root->first_child()->children(1)->parent()->tag . "\n"; # h2

Running out of memory

If your script is running out of memory, you can explicitly tell each DOM object you made to clean itself up. See, for example, this scraper. $html->__destruct();