Приглашаем посетить
Рефераты (referat-lib.ru)

Hack 47. Search Microsoft Word Documents

Previous
Table of Contents
Next

Hack 47. Search Microsoft Word Documents

Hack 47. Search Microsoft Word Documents Hack 47. Search Microsoft Word Documents

Search the text in Microsoft Word documents by parsing WordML files.

A lot of valuable data is locked up in Microsoft Word documents. In particular, documents such as resumes are particularly tempting for data-mining applications. Job boards need code that parses Word documents and finds keywords or phrases to categorize the job candidates. This hack demonstrates how to search Word documents saved as WordML for text strings.

5.15.1. The Code

Save the code shown in Example 5-37 as index.php.

Example 5-37. HTML that handles data uploads
<html>
<body>
	<form enctype="multipart/form-data" action="search.php" method="post">
	WordML file: <input type="hidden" name="MAX_FILE_SIZE" value="2000000" />
	<input type="file" name="file" /><br/>
<input type="submit" value="Upload" />
</form>
</body>
</html>

Save the code in Example 5-38 as search.php. This script looks through the uploaded WordML for specific features.

Example 5-38. Script that handles searching
<html>
<body>
<?php
$wordlist = array();

$dom = new DOMDocument();
if ( $_FILES['file']['tmp_name'] )
{
	$dom->load( $_FILES['file']['tmp_name'] );
	$found = $dom->getElementsByTagName( "t" );

	foreach( $found as $element )
	{
		$words = split( ' ', $element->nodeValue );
		foreach( $words as $word )
	{
		
	$word = preg_replace( '/[,]|[.]/', '', $word );
		$word = preg_replace( '/^\s+/', '', $word );
		$word = preg_replace( '/\s+$/', '', $word );
		if ( strlen( $word ) > 0 )
		{
		$word = strtolower( $word ); 
		$wordlist[ $word ] = 0; 
		}
	}
}
}
$words = array_keys( $wordlist );
sort( $words );

foreach( $words as $word ) {
?>
<?php echo( $word ); ?><br/>
<?php } ?>
</body>
</html>

The search.php script starts by taking the uploaded WordML file and opening it using the XML DOM objects. Then it finds all of the t nodes. t nodes are where the text of the document is stored. From there, it removes any punctuation. It then chops up the remaining text into words and stores those words into a hash table called $wordlist. That word list is then written out at the end of the script.

5.15.2. Running the Hack

Write a simple Microsoft Word 2003 document and save it as a WordML file somewhere on your disk. Then upload these files to your web server and navigate your browser to index.php. It should look like Figure 5-19.

Click on the Browse button and select the WordML file. Then click on the Upload button. That will send the file to the search.php script. That script uses the XML DOM to read the file. The data in the WordML file is sorted and reported on the HTML page, as shown in Figure 5-20.

From here, you can look for specific words, or count the occurrence of certain words [Hack #24].

Figure 5-19. The upload page
Hack 47. Search Microsoft Word Documents


Figure 5-20. The words found in the uploaded document
Hack 47. Search Microsoft Word Documents


Hack 47. Search Microsoft Word Documents

WordML is only supported by Microsoft Word 2003 and later versions. It's not currently supported on the Macintosh, though I expect it will be in later versions. To support older versions of Microsoft Word, you might want to rewrite the hack code to parse RTF instead of WordML. Every recent version of Microsoft Word supports RTF.


5.15.3. See Also


Previous
Table of Contents
Next