This post is a report on my experience running Windows ports of the legacy *NIX file(1) command.

The first stage of processing my roll-your-own terabyte text corpus (after acquiring it from the internet, of course!) is to classify each document in the corpus according to its type.

I require classification of files from the text corpus into their mime type: e.g. ‘application/pdf’, ‘text/html’, ‘image/jpeg’, etc.  The document type classification drives the application of text extraction algorithms in further stages of processing.  Accurate knowledge of document type is important to avoid database corruption caused by applying the wrong text extraction algorithms to a document.  (I do not require 100% accuracy here, since a little corruption in the corpus is inevitable, but the higher accuracy, the better.)

Document type is asserted during collection and analysis of the corpus in four relevant ways: 1, http mime type assertion; 2, document filename extension; 3, output of a document type classification program; 4, context (e.g. the type of link or feed, or surrounding html markup).  Since the document types reported by http headers, filename extensions, and search engine results are not always accurate, in order to be confident in the classification I require agreement between all applicable type assertions before I allow the document into the corpus.

This post relates to method #3 for detecting document type — using a pre-existing classification program. 

I tried two Windows ports of the *NIX ‘file’ command.  One port ran, the other did not.  I have had success with the Optima SC port and have incorporated it into the document processing pipeline.

Attempt #1 (Failure)

Download file utility installer (“Complete package, except sources”) from http://gnuwin32.sourceforge.net/packages/file.htm.

Run the installer.  By default, the executables and necessary .dll files are installed to C:/program files/GnuWin32/bin/.  The ‘magic.mgc’ database file is installed to C:/program files/GnuWin32/share/file/magic.mgc.

Either add the install bin/ directory to your path, or copy the installed files to a directory your path already points to.  Add an environment variable MAGIC with the value c:/program files/GnuWin32/share/file.  (mutatis mutandis)

Run the file utility on a collection of files.  Result: program spits thousands of lines of warnings.  Conclusion: Does not work!

Attempt #2 (Success)

Download file utility from http://www.optimasc.com/products/fileid/file0.6.1-win32.zip.  Download type identification database from http://magicdb.org/magic.db.

Either add the install bin/ directory to your path, or copy the executable ‘file.exe’ and the identification database ‘magic.db’ to a directory in your search path.

Run the file utility.  Result: Default arguments work!

Further tests:
C:\> file –check-brief *
C:\> [GOOD OUTPUT!]

C:\> file –check-standard *
C:\> [GOOD OUTPUT!]

C:\> file –check-harder *
C:\> [CRASH!]

The program seems to work ok in the ‘brief’ and ’standard’ modes.  The program crashes in ‘harder’ mode.  The program also crashes when asked to run on a directory containing 50,000 files.  The output is useful (though it must be parsed), and the program is reliable when run on a single file at a time.  It is possible to produce output in UTF-8.

In a test I ran using approximately 100,000 documents downloaded from the internet, the initial type prediction based on context and filename extension was verified by the http content-type assertion and the output of the optimasc ‘file’ document type classification program 92% of the time.

In a test of 100 known pdf files, the optimasc file program correctly classified all 100 files as application/pdf.  In a test of 100 known well-formed html files, the optimasc file program correctly classified all 100 files as text/html.

The ‘file’ utility under Windows appears to be due to Carl Eric Codere of Optima SC Inc.  (http://www.optimasc.com/)  I appreciate Carl Eric’s efforts, and the decision by Optima SC to make the utility available.

 

One Response to “Simple Automated File Type Detection Under Windows”

  1. Thanks for the nice review of my file identifier!! I have released recently a new revision of the file identifier which corrects several bugs (and probably created some!!)… Enjoy the tool! Don’t hesitate to contact me if you see problems!

Leave a Reply