How Big is the Internet, Really?

Any discussion of internet data volumes is bound to cross into a starry-eyed, self-congratulatory contemplation of mind-numbing and exotic prefixes like peta-, exa-, and zetta- so let me perform the obligatory obeisances now:  The Internet is big.  It is really, really big.  So big your mind cannot hold it; the stars cannot enfold it.  If your measly brain were a betel nut, the internet would be Betelgeuse.

That said, let’s restrict our attention to text on the web and proceed to consider some numbers.

[In what follows I am only considering the amount of text on the web!  I am not considering images, video, audio, data, or anything else -- just text.]

Absolute Upper Bound on Rate of Increase

How fast could the textual internet possibly grow?

There are currently about 6*10^9 people on the earth.  Figure earth’s population might stabilize around 10^10 people.

In the future all of our words may be automatically transcribed and recorded to the internet by speech recognition technology.  Since we can speak faster than we can write, this gives us an upper bound on the amount of text any person might be able to generate.

If we figure an average of 10,000 spoken words per day per person and multiply by 10^10 people, we get about 10^14 words per day.  At 10 bytes per word, this is 10^15 bytes per day. (Pennebaker et al. measured approximately 15,000 words per day per college student.  Multiply this number by 2/3 to account for the more muted very young and very old.)

This works out to about 4*10^17 bytes of text per year =  400 petabytes per year added to the internet.

Reasonable Upper Bound on Rate of Increase

How fast could the textual internet reasonably grow?

If every human had a blog (somebody’s Satanic Vision, I’m sure) and every human were assiduous about keeping it, we can figure 10 kB per day per person as an absolute maximum.  Multiply it out and we get a maximum increase of textual data per day on the web of 10^14 Bytes.  That is 100 terabytes added to the internet per day.

More realistically, we should imagine an output of one thousandth of that.  [Currently, there are 10^9 people with net access, about 1/100 as many blogs, and about 1/1000 as many have active blogs.] 100 terabytes / 1000 =  0.1 terabyte per day increase.  This corresponds to about 40 terabytes per year of text added to the internet.

Reasonable Estimate of Current Rate of Increase

How fast is the textual internet currently growing?

Spinn3r.com says there are currently approximately 12 M blogs.  If everybody wrote 10 kB per day (highly unrealistic!), this would work out to 12 GB per day of text.  More realistically, we should figure about a tenth of that for blogs.  Add another tenth for web pages.  A realistic estimate of daily growth in web text volume is more like 2.5 GB.

In a year, this works out to about 10^12 bytes of text per year.

Reasonable Estimate of Accumulated Human Text

How must text is there in the world now?

Pandia.com estimates there are about 20 * 10^9 documents on the web as of early 2007.  If we figure an average of 1500 bytes per document we get 3*10^13 bytes of text already on the web.

There are about 10^6 distinct books published per year world wide.  At 10^5 bytes per book, this is 10^11 bytes per year.  Figure we’ve been publishing at near this rate for 100 years.  This means an accumulated store of 10^13 bytes.

Adding the internet and printed sources together gives about 4*10^13 bytes of text in the total human corpus.

Summary

The accumulated human output of text is somewhere around 40 terabytes.  The total is growing at the rate of about 1 terabyte per year.  The rate of growth might realistically increase to as much as 40 terabytes per year as more people get web access and the technology becomes more familiar to people.

This is a lot of text, but figure this: you could go down to your local computer store and buy a 1 TB hard drive for about $100.  It is well within the budget of most of us to afford to store everything ever published by a human being anywhere.  In addition, we could store  humanity’s yearly output of text for about $100 per year.

Furthermore, there is a limit on the amount of original textual information humanity might produce in a given year.  That limit is approximately 400 petabytes per year. 

Post Script: Real-Time Stream of All Human Text

1 terabyte per year is about 2.5 gigabytes per day.  This works out to an uncompressed data rate of less than 250 kbit per second.  The compressed data rate would be less than 100 kbit per second.  By way of comparison, the OECD defines a broadband connection as 256 kbits per second.  So any person with a broadband connection could stream all human electronically published information from the entire globe in real time.

So here’s an idea for a web service.  Register the domain www.humancorpus.com. (It’s free; I checked.)  Spider the web and extract all text.  Throw away pictures, embedded audio, everything but text. Compress it and stream it: real time.  Google could do it as a public service.  Cool, huh? 

One Response to “The Size of the Web – Just the Text”

Leave a Reply