Hack 47. Search Microsoft Word Documents
Search the text in Microsoft Word documents by parsing WordML files. A lot of valuable data is locked up in Microsoft Word documents. In particular, documents such as resumes are particularly tempting for data-mining applications. Job boards need code that parses Word documents and finds keywords or phrases to categorize the job candidates. This hack demonstrates how to search Word documents saved as WordML for text strings. 5.15.1. The CodeSave the code shown in Example 5-37 as index.php. Example 5-37. HTML that handles data uploads<html> <body> <form enctype="multipart/form-data" action="search.php" method="post"> WordML file: <input type="hidden" name="MAX_FILE_SIZE" value="2000000" /> <input type="file" name="file" /><br/> <input type="submit" value="Upload" /> </form> </body> </html> Save the code in Example 5-38 as search.php. This script looks through the uploaded WordML for specific features. Example 5-38. Script that handles searching<html> <body> <?php $wordlist = array(); $dom = new DOMDocument(); if ( $_FILES['file']['tmp_name'] ) { $dom->load( $_FILES['file']['tmp_name'] ); $found = $dom->getElementsByTagName( "t" ); foreach( $found as $element ) { $words = split( ' ', $element->nodeValue ); foreach( $words as $word ) { $word = preg_replace( '/[,]|[.]/', '', $word ); $word = preg_replace( '/^\s+/', '', $word ); $word = preg_replace( '/\s+$/', '', $word ); if ( strlen( $word ) > 0 ) { $word = strtolower( $word ); $wordlist[ $word ] = 0; } } } } $words = array_keys( $wordlist ); sort( $words ); foreach( $words as $word ) { ?> <?php echo( $word ); ?><br/> <?php } ?> </body> </html> The search.php script starts by taking the uploaded WordML file and opening it using the XML DOM objects. Then it finds all of the t nodes. t nodes are where the text of the document is stored. From there, it removes any punctuation. It then chops up the remaining text into words and stores those words into a hash table called $wordlist. That word list is then written out at the end of the script. 5.15.2. Running the HackWrite a simple Microsoft Word 2003 document and save it as a WordML file somewhere on your disk. Then upload these files to your web server and navigate your browser to index.php. It should look like Figure 5-19. Click on the Browse button and select the WordML file. Then click on the Upload button. That will send the file to the search.php script. That script uses the XML DOM to read the file. The data in the WordML file is sorted and reported on the HTML page, as shown in Figure 5-20. From here, you can look for specific words, or count the occurrence of certain words [Hack #24]. Figure 5-19. The upload pageFigure 5-20. The words found in the uploaded document5.15.3. See Also
|