Too Much Data
Bill de hÓra writes that Phat Data is the challenge of the future. Couldn't agree more. My recent work with data warehouses certainly has shown me that managing and accessing terabytes of data is non trivial.
We've learned a few things, most importantly, "Denormalize and aggregate." Avoiding I/O is the most important step to take. And we've achieved some pretty decent performance numbers with a traditional relational database. However, as Bill points out, we're using it as a big indexed file system.
But having SQL and the numerous tools that support SQL has been critical to our success. I can't imagine solving these problems with proprietary tools. Sure, it's possible. Google did it, but they have more PhD's than you can shake a stick at. Plus some mega clusters.
While multi-core CPUs are a welcomed upgrade, what I really want is multi-spindle hard drives. Call me when I can emulate a google cluster in my desktop. What's lacking is a cheap and effective way to parallel my disk activity.
What would be really nice is to turn my corporate network into a giant compute farm utilizing both all those CPUs and all those hard drives. Now that is really turning the network into the computer. So, don't give me EC2, give me EC2 in my office. With everyone using their huge desktops just to read email and write PowerPoint, I know there's a ton of unused computing power. This is P2P with a purpose.
We've learned a few things, most importantly, "Denormalize and aggregate." Avoiding I/O is the most important step to take. And we've achieved some pretty decent performance numbers with a traditional relational database. However, as Bill points out, we're using it as a big indexed file system.
But having SQL and the numerous tools that support SQL has been critical to our success. I can't imagine solving these problems with proprietary tools. Sure, it's possible. Google did it, but they have more PhD's than you can shake a stick at. Plus some mega clusters.
While multi-core CPUs are a welcomed upgrade, what I really want is multi-spindle hard drives. Call me when I can emulate a google cluster in my desktop. What's lacking is a cheap and effective way to parallel my disk activity.
What would be really nice is to turn my corporate network into a giant compute farm utilizing both all those CPUs and all those hard drives. Now that is really turning the network into the computer. So, don't give me EC2, give me EC2 in my office. With everyone using their huge desktops just to read email and write PowerPoint, I know there's a ton of unused computing power. This is P2P with a purpose.