This post is a report on my experience running Windows ports of the legacy *NIX file(1) command.

The first stage of processing my roll-your-own terabyte text corpus (after acquiring it from the internet, of course!) is to classify each document in the corpus according to its type.

I require classification of files from the text corpus into their mime type: e.g. ‘application/pdf’, ‘text/html’, ‘image/jpeg’, etc.  The document type classification drives the application of text extraction algorithms in further stages of processing.  Accurate knowledge of document type is important to avoid database corruption caused by applying the wrong text extraction algorithms to a document.  (I do not require 100% accuracy here, since a little corruption in the corpus is inevitable, but the higher accuracy, the better.)

Document type is asserted during collection and analysis of the corpus in four relevant ways: 1, http mime type assertion; 2, document filename extension; 3, output of a document type classification program; 4, context (e.g. the type of link or feed, or surrounding html markup).  Since the document types reported by http headers, filename extensions, and search engine results are not always accurate, in order to be confident in the classification I require agreement between all applicable type assertions before I allow the document into the corpus.

This post relates to method #3 for detecting document type — using a pre-existing classification program. 

I tried two Windows ports of the *NIX ‘file’ command.  One port ran, the other did not.  I have had success with the Optima SC port and have incorporated it into the document processing pipeline.

Attempt #1 (Failure)

Download file utility installer (“Complete package, except sources”) from http://gnuwin32.sourceforge.net/packages/file.htm.

Run the installer.  By default, the executables and necessary .dll files are installed to C:/program files/GnuWin32/bin/.  The ‘magic.mgc’ database file is installed to C:/program files/GnuWin32/share/file/magic.mgc.

Either add the install bin/ directory to your path, or copy the installed files to a directory your path already points to.  Add an environment variable MAGIC with the value c:/program files/GnuWin32/share/file.  (mutatis mutandis)

Run the file utility on a collection of files.  Result: program spits thousands of lines of warnings.  Conclusion: Does not work!

Attempt #2 (Success)

Download file utility from http://www.optimasc.com/products/fileid/file0.6.1-win32.zip.  Download type identification database from http://magicdb.org/magic.db.

Either add the install bin/ directory to your path, or copy the executable ‘file.exe’ and the identification database ‘magic.db’ to a directory in your search path.

Run the file utility.  Result: Default arguments work!

Further tests:
C:\> file –check-brief *
C:\> [GOOD OUTPUT!]

C:\> file –check-standard *
C:\> [GOOD OUTPUT!]

C:\> file –check-harder *
C:\> [CRASH!]

The program seems to work ok in the ‘brief’ and ’standard’ modes.  The program crashes in ‘harder’ mode.  The program also crashes when asked to run on a directory containing 50,000 files.  The output is useful (though it must be parsed), and the program is reliable when run on a single file at a time.  It is possible to produce output in UTF-8.

In a test I ran using approximately 100,000 documents downloaded from the internet, the initial type prediction based on context and filename extension was verified by the http content-type assertion and the output of the optimasc ‘file’ document type classification program 92% of the time.

In a test of 100 known pdf files, the optimasc file program correctly classified all 100 files as application/pdf.  In a test of 100 known well-formed html files, the optimasc file program correctly classified all 100 files as text/html.

The ‘file’ utility under Windows appears to be due to Carl Eric Codere of Optima SC Inc.  (http://www.optimasc.com/)  I appreciate Carl Eric’s efforts, and the decision by Optima SC to make the utility available.

 

Intro to the Feed Manager Application

There are plenty of feed readers, but there don’t seem to be any feed scrapers readily available.  By a `feed scraper’ I mean a program that will download and store the actual text of feed items gathered from rss (or atom) feeds.

News Feed Scraper

I am releasing my news feed scraper.  The development environment for the release is C++ under Microsoft Visual Studio 2005 using Qt Commercial Edition from Trolltech.  If you want to just run the executables you do not need Visual Studio or Qt.  The executables have been tested on Vista and XP.

The release consists of three downloads:

  • A set of development libraries here;
  • A generic template for creating plugins here;
  • A feed server and feed client here.

In order to run the software you will need to:

  • Download the feed server and client;
  • Unzip the download.  This will create a FeedManager and a FeedServer directory;
  • Run FeedClient/Executable/FeedClient.exe;
  • Run FeedManager/Executable/FeedManager.exe;
  • Drag the url from a feed into the FeedManager…

In order to compile the software you will need to:

  • Download the three downloads;
  • Follow the directions here to install the development libraries (in the course of which you will define an environment variable $(DIR4APPS), which contains the name of your preferred installation directory);
  • Install the plugin solution in the $(DIR4APPS) directory.  This should create a $(DIR4APPS)/ModifyUrlPlugins/ directory;
  • Install the FeedManager solution in the $(DIR4APPS) directory;
  • You should be able load the solution file $(DIR4APPS)/FeedManager/FeedManager.sln.

The ModifyUrlPlugins directory illustrates how to write .dll plugins under Qt that can be explicitly linked at run-time with a statically-linked executable.  See here for a write-up on this.

Function of the Feed Manager Application

The news feed scraper consists of two separate applications: a Feed Client, and a Feed Server.

The Feed Server maintains a database containing the URLs of a series of rss feeds.  Periodically, the Feed Server checks to see if there are any new articles mentioned in the rss feeds, by downloading a new version of the stalest feed in its database.  When the server encounters a new article mentioned in a new version of a feed, it puts the URL of the article into a table in its database.

The Feed Client periodically checks with the Feed Server to see if it knows of any new articles.  If it does, the Feed Client downloads the article and stores it, uncompressed, in a directory called Articles.  If the article is unavailable, for any reason, it is skipped and not retried.

The Feed Server and Feed Client communicate with one another via the UDP protocol.  They can be run on the same machine or on different machines visible to one another on the same network.  A maximum of 3 Feed Clients can communicate with one Feed Server.  The Feed Server downloads feeds; the Feed Client downloads items (articles) mentioned in those feeds.

Screen Shots

The Feed Server

Here’s a screen shot of the Feed Server.  The application has two modes: server mode, and database Mode.  In server mode the application displays the articles being served and their status.  In database mode the application allows you to edit the database of feeds the application is managing.  The application continues to download feeds and manage communication with the Feed Client in the background while in database mode.

feedserver.jpg

 

The Feed Client

Here’s a screen shot of the Feed Client.  The application has only one mode: client mode.  In client mode the application displays the articles being downloaded and their status.

feedclient.jpg

Status of the Application

I have used the application to download about a gigabyte (so far) of text from news feeds.  The application is a reasonably well-behaved net citizen: it has a substantial delay between repeat visits to the same web-site, for instance.  However, it does not use the http `if-modified-since’ feature.  It probably should.

The application runs pretty smoothly for an application that hasn’t been banged on too much, but I’m sure there are still a few bugs in there somewhere.  Never forget the General Rule of Debugging:

  1. Bugs occur with all frequencies;
  2. The time it takes to identify and fix a bug is proportional to the inverse of its frequency.

Therefore, if you’ve only banged on an application for a few weeks, you are going to miss the bugs that only occur on a monthly or yearly schedule.  And just because you haven’t seen such infrequent bugs doesn’t mean they aren’t there.

 

The Linguistic Data Consortium makes a number of linguistic corpora available for research and development purposes.  They are generally tagged by part-of-speech, which is nice; but for my purposes the corpora are generally overpriced and undersized. 

The largest corpus I can reasonably handle is about 1 terabyte — representing about 100 billion words in 1 billion sentences.  I am willing to allocate 100 days to acquiring the corpus.  How can I get a terabyte of text for cheap in 100 days?

Data Sources

  • Download the Gutenberg corpus.
  • Download WordNet.  WordNet is a dictionary database that will be fantastically useful for boot-strapping semantic analysis.
  • Use rss feeds from several of the news organizations.
  • Scrape text from the web.  My impression is that pdf files will offer the highest  quality source for text for the following reasons:
    1. Pdf is the format of choice for longer and more formal documents;
    2. Short pdf’s tend to be advertising brochures, but longer pdf’s tend to be advertising-free;
    3. A number of programs already exist for extracting text from pdf files.  Thus, if we focus on pdf files alone the parsing job is simplified.

Data Volumes

The Gutenberg corpus contains roughly 16 gigabytes of text.  Probably 10 gigabytes of that is usable, non-duplicated English. 

WordNet is relatively small, but powerful.

It is difficult (a priori) to estimate the amount of text I can get from news sources, but it might be possible to get ~10 MB per day of text spanning a large number of different topic areas.  In 100 days, this could accumulate to 1 gigabyte.  If I added blog sources I could get more text, but the advantage in using news sources is that the text tends to follow a predictable format.  The parsing problem for blogs would be overwhelming.

Unique amongst all the search engines, Alexa Web Search is willing to sell you the complete results of a web search query.  With Alexa, I can request, and they will deliver, a list of all the pdf files on the web.  I registered as a user and requested a list of all URLs of all PDFs larger than 128 kBytes.  I got a list of about 1/2 million URLs.  This represents a text corpus of about 0.6 terabyte.  It cost me a few dollars.

Method of Acquisition

My internet connection averages about 100 megabytes per hour downloading.  Thus, I can download the Gutenberg corpus in a couple of days.  WordNet downloads in less than an hour.

100 megabytes per hour is also plenty sufficient to keep up with the rate of news production.

However, at 100 megabytes per hour it would take about 9 months to download the 0.6 terabytes in my pdf database.  For this, I will need a bigger pipe.  I will need to rent space in a server farm.  At 2 or 3 megabit per second (a typical server farm rate), you can get 600 gigabytes in about a month.