For learning purposes, I want to make a simple web indexer which crawls the web and saves all found pages in a MySQL database with their titles and URLs, with this table (the page's content is not saved):
- id: integer AUTO_INCREMENT PRI
- title: varchar(100)
- url: varchar(500)
How big would that database be approximately? Is it about hundreds of MB, GB or around TBs? Thanks.
-
Hi Koning,
For the quick and dirty answer, scroll the bottom. Otherwise, read through my narrative to understand how I came up with those numbers.
In 2008, Google released some numbers that might be of interest of you. At that time, Google's spiders were aware of over 1 trillion (that's 1,000,000,000,000) unique URLs. One thing to take note of is that not all of these URLs are indexed. For your case here, we'll pretend that we are going to index everything. You can read this announcement here: http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html
The current size of your
idcolumn only allows for 2 billion URLs in the index. If you make that anunsigned intyou can squeeze 4 billion out, but assuming a near-infinite scale you'd want to use anunsigned bigintIn all reality, you'd want to use a uuid or something similar so you can generate IDs concurrently (and from multiple hosts,) but for this exercise, we will assume that we are using anunsigned bigint.So, in theory, we've got this infinitely scalable MySQL table that is defined such as:
- id:
unsigned bigint AUTO_INCREMENT - title:
varchar(100) - url:
varchar(500)
The storage requirements for each of these columns are:
- id: 8 bytes
- title: 100 + 1 = 101 bytes
- url: 500 + 2 = 502 bytes
- Row size*: 502 + 101 + 8 = 611 bytes (Neglecting overhead, table headers, indexes, etc)
Reference: http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
Now, to get the theoretical table size we simply multiply by our 1 trillion unique URLs:
611 bytes * 1,000,000,000,000 URLs = 611,000,000,000,000 bytes =~ 555.7 terabytes
So there you have it. 1 trillion URLs times the storage size of the table we defined would take up almost 556 terabytes of data. We would also have to add data for indexes, table overhead, and some other things. Likewise, we could also subtract data because for our exercise I assumed each
varcharcolumn was being maxed out. I hope this helps.(Also, just a quick clarification: I know that bigint columns aren't near-infinite, but doing the math is easier when you're not worrying about logistics)
From Charles Hooper - id:
0 comments:
Post a Comment