Friday, January 13, 2006

Data Warehouses (2006) Growth Keeps Going and Going

Databases and more specifically data warehouses are growing at an ever faster rate. "Data, Data, Everywhere" provides some metrics on the largest known data warehouses. The size of these complexes are roughly doubling every 12-18 months. We have seen this and even greater where I work. Wal-mart is approaching 600 terabytes today (Jan2006) and they project to be above the petabyte mark later this year.

I remember in the late 1990s when a terabyte was milestone. Well, today most of the larger data warehouse complexes are approaching the petabyte mark. EBay and Yahoo have over 100 terabyte today. Google is not mentioned and as usual they are relatively quiet about their metrics. I would expect that Google is in the Yahoo range or larger. The same goes for Amazon. Yahoo is mentioned as the largest commerical data warehouse based on the Winter Corp. survey conducted in mid-2005.

One common architectural characteristic with all these large data warehouses is the use of massive parallel clustering. The Wal-Mart complex has a massively parallel 1000-processor system. There are no details about how many server machines are used. Rumors about Google are that they have a 100k server machine massively parallel system based on open source technology. The power consumption of a system like this is mind boggling alone.

The growth rates mentioned are staggering. Wal-Mart adds over a billion rows of new data a day. EBay adds approximately 750k rows a day. "Database Lessons, Petabyte Style" mentions a Stanford University research database (Stanford Linear Accelerator Center) that was adding 500GB, yes that is gigabyte, of data a day in 2004. That meant every 29-days they were accumulating data tha is equal to all the books in the Library Of Congress! Whoa!

What do all these data warehouse complexes have in common? They all require massive clusters of servers and all have issues with managing their storage capacity. As Inmon put it, 'volume, volume, volume'. That is and always will be the #1 problem with data warehousing.

No comments: