The Linguistic Data Consortium makes a number of linguistic corpora available for research and development purposes.  They are generally tagged by part-of-speech, which is nice; but for my purposes the corpora are generally overpriced and undersized. 

The largest corpus I can reasonably handle is about 1 terabyte — representing about 100 billion words in 1 billion sentences.  I am willing to allocate 100 days to acquiring the corpus.  How can I get a terabyte of text for cheap in 100 days?

Data Sources

  • Download the Gutenberg corpus.
  • Download WordNet.  WordNet is a dictionary database that will be fantastically useful for boot-strapping semantic analysis.
  • Use rss feeds from several of the news organizations.
  • Scrape text from the web.  My impression is that pdf files will offer the highest  quality source for text for the following reasons:
    1. Pdf is the format of choice for longer and more formal documents;
    2. Short pdf’s tend to be advertising brochures, but longer pdf’s tend to be advertising-free;
    3. A number of programs already exist for extracting text from pdf files.  Thus, if we focus on pdf files alone the parsing job is simplified.

Data Volumes

The Gutenberg corpus contains roughly 16 gigabytes of text.  Probably 10 gigabytes of that is usable, non-duplicated English. 

WordNet is relatively small, but powerful.

It is difficult (a priori) to estimate the amount of text I can get from news sources, but it might be possible to get ~10 MB per day of text spanning a large number of different topic areas.  In 100 days, this could accumulate to 1 gigabyte.  If I added blog sources I could get more text, but the advantage in using news sources is that the text tends to follow a predictable format.  The parsing problem for blogs would be overwhelming.

Unique amongst all the search engines, Alexa Web Search is willing to sell you the complete results of a web search query.  With Alexa, I can request, and they will deliver, a list of all the pdf files on the web.  I registered as a user and requested a list of all URLs of all PDFs larger than 128 kBytes.  I got a list of about 1/2 million URLs.  This represents a text corpus of about 0.6 terabyte.  It cost me a few dollars.

Method of Acquisition

My internet connection averages about 100 megabytes per hour downloading.  Thus, I can download the Gutenberg corpus in a couple of days.  WordNet downloads in less than an hour.

100 megabytes per hour is also plenty sufficient to keep up with the rate of news production.

However, at 100 megabytes per hour it would take about 9 months to download the 0.6 terabytes in my pdf database.  For this, I will need a bigger pipe.  I will need to rent space in a server farm.  At 2 or 3 megabit per second (a typical server farm rate), you can get 600 gigabytes in about a month.

Leave a Reply