SEARCH

Friday, April 10, 2009

Easy Screen Scraping in PHP with the Simple HTML DOM Library

Client-side developers always had it easy - libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground - the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.

The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.

Here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/');   foreach($html->find('a') as $element)     echo $element->href;

This snippet is fairly self explanatory - file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements - in this case, anchors - and iterate over them just as we would with PHP 5’s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)

But the library doesn’t stop there - as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:

$html = str_get_html(' 
Hello
World
'
); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo';

The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node -children(), parent(), first_child() and so on.

Real scraping? Easy. Here’s their Slashdot sample:

$html = file_get_html('http://slashdot.org/');   foreach($html->find('div.article') as $article) {     $item['title']     = $article->find('div.title', 0)->plaintext;     $item['intro']    = $article->find('div.intro', 0)->plaintext;     $item['details'] = $article->find('div.details', 0)->plaintext;     $articles[] = $item; }   print_r($articles);

And finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.

No comments:

Post a Comment