Приглашаем посетить

Hack 44. Scrape Web Pages for Data

Use regular expressions to scrape data from sources like Metacritic.

What do you do when you want the data from a site, but the site won't let you export that data in a predictable format (like XML [Hack #38] or CSV [Hack #43])? One popular option is to perform what's called a screen scrape on the HTML to extract the data. Screen scraping starts with downloading the contents of the page containing the data into either a string in memory or a file. Regular expressions are then used to extract the relevant data from the string or file.

You can scrape almost any web site for data; for the example in this hack, I chose the Metacritic DVD review page (http://www.metacritic.com/video/).

Figure 5-9. The resulting generated PHP

Metacritic is a site where movies, music, and video games are given a review score based on a selection of reviews. Figure 5-10 shows the Metacritic page that I scraped for this hack. On the lefthand side of the window is a list of movies ordered by name, along with their review scores.

I can tell from the size of the page that I want only a small portion of the HTML. I use View Source to see what the code looks like, and indeed there is a section for these scores well defined by a div tag that contains what I'm looking for:

	</TR>
	</TABLE>
	  <DIV ID="sortbyname1">
	  <P CLASS="listing">
	  <SPAN CLASS="yellow">51</SPAN>
		  <A HREF="/video/titles/800bullets">800 Bullets</A><BR>
	  <SPAN CLASS="yellow">58</SPAN>
		  <A HREF="/video/titles/actsofworship">Acts of Worship</A><BR>
	  <SPAN CLASS="green">81</SPAN>
		<A HREF="/video/titles/badeducation"><B>Bad Education</B></A><IMG
	  SRC="/_/images/010/scores/star.gif" WIDTH="11" HEIGHT="11" ALIGN="absmiddle"><BR>
	  …

The first step will be to extract just this div tag. Then we need to use another regular expression to pick out each movie entry from text within the div tag. Notice that each movie listing starts with a span tag and ends with a br tag; that's good enough to delineate each movie. The third listing has some extra stuff around the movie title that I strip out with another set of regular expressions.

Figure 5-10. The Metacritic DVD and Video Review page

I strongly recommend using a divide-and-conquer technique when writing screen-scraping code. Don't try to do all of the work with a single regular expression, or you'll end up with indecipherable code that even you can't maintain.

5.12.1. The Code

Save the code in Example 5-33 as scrapecritic.php.

Example 5-33. PHP for loading a URL and scraping content from it

<html>
<?
// Set up the CURL object
$ch = curl_init( "http://www.metacritic.com/video/" );

// Fake out the User Agent
curl_setopt( $ch, CURLOPT_USERAGENT, "Internet Explorer" );

// Start the output buffering
ob_start();

// Get the HTML from MetaCritic
curl_exec( $ch );
curl_close( $ch );

// Get the contents of the output buffer
$str = ob_get_contents();
ob_end_clean();

// Get just the list sorted by name
preg_match( "/\<DIV ID=\"sortbyname1\"\>(.*?)\<\/DIV\>/is",
		$str, $byname );

// Get each of the movie entries
preg_match_all( "/\<SPAN.*?>(.*?)\<\/SPAN\>.*?\<A.*?\>(.*?)\<BR\>/is",
		$byname[0], $moviedata );

// Work through the raw movie data
$movies = array();
for( $i = 0; $i < count( $moviedata[1] ); $i++ )
{
		// The score is ok already
		$score = $moviedata[1][$i];
			
		// We need to remove tags from the title and decode
		// the HTML entities
		$title = $moviedata[2][$i];
		$title = preg_replace( "/<.*?>/", "", $title );
		$title = html_entity_decode( $title );
			
		// Then add the movie to the array
		$movies []= array( $score, $title );
}
?>
<body>
<table>
<tr>
<th>Name</th><th>Score</th>
</tr>
<? foreach( $movies as $movie ) { ?>
<tr>
<td><? echo( $movie[1] ) ?></td>
<td><? echo( $movie[0] ) ?></td>
</tr>
<? } ?>
</table>
</body>
</html>

The scrapecritic.php script starts by downloading the current contents of the Metacritic DVD page into a string. It does this by using the ob_start( ), ob_get_contents(), and ob_end_clean() functions to grab the text that curl_exec() would have put into the page, and instead copies it into a string.

The next step is to grab just the div tag that corresponds to the list sorted by name, using a preg_match() with a regular expression customized to this particular page. This is a clear demonstration of the primary technical problem with screen scraping: if the site being scraped changes its formatting in even the slightest way, it can (and probably will) break the scraping code. It's always better to get an XML feed for the data if that's possible. XML is far more resilient to changes in format.

With the name-sorted list in hand, the script then uses preg_match_all() to extract all of the movie names and scores into an array. The final step is to take this array of movies and strip the movie name of any extraneous tags or formatting.

At this point, the data is cleaned and ready to be presented. The script uses a simple foreach loop to create a table that shows the name of the movie and the aggregated review score.

5.12.2. Running the Hack

To run the hack, copy the file onto your PHP server and surf to it in your web browser. The result should look like Figure 5-11.

Hack 44. Scrape Web Pages for Data
Another use for screen scraping is content type conversion. You can take what was an HTML page and turn it into a WML page for web-enabled phones, or an RSS feed for news aggregators.

5.12.3. Problems with Screen Scraping

There are two major problems with screen scraping. The first is technical and the second is legal. On the technical side, screen scraping is inclined to break when the site being scraped changes its format. In addition, the scraping code for one site will likely not work on other sites because of formatting issues. Finally, screen scraping can be slow or even break when the target site is not responding to web requests in a timely manner.

Judiciously choosing which pages you can scrape is also important. Look for pages that were generated by a web application, as opposed to written by hand. Handwritten pages will have almost random markup; application-generated pages usually have a predictable format that will make writing regular expressions to match the format a lot easier.

Figure 5-11. The resulting screen-scraped page

Hack 44. Scrape Web Pages for Data
Web application pages normally end with extensions such as .php, .jsp, .asp, or some similar variant. Handwritten pages usually have the .htm or .html extension.

On the legal side, you must always make sure that you have permission to use the data in this way before adding this functionality to your site. There's nothing worse than writing lots of screen-scraping code only to find out that the content you've scraped was obtained illegally and cannot be used.

5.12.4. See Also

"Spider Your Site" [Hack #84]
"Test Your Application with Robots" [Hack #83]

Table of Contents