Posts

Showing posts with the label database

Reasons Why CouchDB Is Exciting

I've been playing with CouchDB lately, and having a good time getting out of the relational database mindset. I'm not against relational databases, but I do think the time has come to stop thinking they are the only way to store and process data. There are a few reasons why I think CouchDB is exciting. Arbitrary JSON document storage. I like this feature not so much because I'm not required to design a rigid schema, but because it implies that I can store arbitrarily complex and deeply nested data structures. This clears the way for easily storing an object graph without the pains of going through the pains of decomposing into a relational schema. Incremental view rebuilds. This is a killer feature for me, because it implies that a small change in the data does not invalidate the entire view. Keys can be complex objects. In other words, you can now use an array or hash as a key in a view. For example, a key can be [2008, 01, 12] which represents a date. N...

Too Much Data

Bill de hÓra writes that Phat Data is the challenge of the future. Couldn't agree more. My recent work with data warehouses certainly has shown me that managing and accessing terabytes of data is non trivial. We've learned a few things, most importantly, "Denormalize and aggregate." Avoiding I/O is the most important step to take. And we've achieved some pretty decent performance numbers with a traditional relational database. However, as Bill points out, we're using it as a big indexed file system. But having SQL and the numerous tools that support SQL has been critical to our success. I can't imagine solving these problems with proprietary tools. Sure, it's possible. Google did it, but they have more PhD's than you can shake a stick at. Plus some mega clusters. While multi-core CPUs are a welcomed upgrade, what I really want is multi-spindle hard drives. Call me when I can emulate a google cluster in my desktop. What's lacking is ...

Yes, database normalization is good

So InfoQ has collected a few blog posts which ask Data normalization, is it really that good? Of course it's good, as long as you have requirements which dictate this optimization. If your application requires extremely fast writes, and this can happen in a heavy loaded OLTP system, then data normalization is your savior. If your application requires extremely fast reads, like OLAP systems, then of course data normalization is a killer. These competing requirements are exactly why you have database systems optimized for either read or write. This is why large systems will maintain an operational system conforming to OLTP principles, and reporting systems conforming to OLAP principles. Remember, traditional database systems are row oriented. This architecture is itself an optimization for OLTP and normalized data. Read mostly (or read only) systems can be column oriented , which organize the data on disk to optimize reads. For instance, Google's BigTable is an implem...

Calculating and Compressing OLAP Cubes

Over the weekend, I've been hard at work at writing an implementation of the Dwarf algorithm for compressing and calculating (aggregating) OLAP cubes. The initial results look good, and I've learned a great deal. To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don't want to wait while you perform a SQL GROUP BY and SUM over 10 million rows. So an effective OLAP cube calculates all of the GROUP BY combinations ahead of time, dramatically speeding up users' queries against the cube. The trouble is that the number of GROUP BY combinations increases exponentially as the numbers of dimensions (and dimension attributes) increases. Plus, data warehouses are meant to store data over multiple years, and at a very fine gra...

Dabble DB Brings the Web of Data to Life

Dabble DB has completely blown me away. Dabble DB is like Club Med for your data. You want your data to get a massage while sipping a Mai Tai on the beach, you got it. Your data will get the five star treatment at Dabble DB. So everyone is talking about Web 3.0 , AKA the Web of Data, AKA the Semantic Web . Those visions are all well and good, and I do believe we'll see a Data Centric Web soon. But if there's a Web of Data, that must mean you've got Data on the Web. What? Your data is buried in some SQL Server database on the company LAN? That doesn't sound very webby to me. And you're building all these custom, one-off, Visual Basic apps or Excel macros manage your data? Tisk, tisk. So not webby. This is where Dabble DB comes in. Not only does it provide a very slick, dripping with AJAX interface for you to import and manage your data, it's a *very* smart interface. Normal muggles (Haven't read Harry Potter? Whaaaa?) can easily use Dabble DB ...

links for 2007-03-30

Implementing Data Cubes Efficiently How to choose which views to materialize in an OLAP cube, when it is too expensive to materialize all views. This is the next optimization for our aggregation strategies in ActiveWarehouse. (tags: olap database )

ActiveWarehouse Gets Some Love

ActiveWarehouse , the Ruby on Rails plugin for data warehouse development, was written up by InfoQ in their article ActiveWarehouse, a New Step for Enterprise Ruby . I've been writing different aggregation strategies for ActiveWarehouse, trying to find something that's not too slow or cumbersome. ActiveWarehouse supports pluggable aggregation, or rollup, strategies, so you can use what works best for you. We have some very large data sets and very large dimensions (one dimension we have has 215 million rows). So if ActiveWarehouse can eventually handle that, I think we're in good shape. I can say that ActiveWarehouse will work great if you have a smallish data set. I would say up to a million rows in your dimensions would be big enough. Of course, no matter how much work we put into optimizing ActiveWarehouse's aggregation schemes, smart database tuning will always help tremendously.

links for 2007-03-28

Pentaho Analysis Services: Aggregate Tables How Mondrian builds and utilizes aggregate tables to help query performance of large cubes. (tags: olap database ) On the Computation of Multidimensional Aggregates This paper presents fast algorithms for computing a collection of groupbys. (tags: olap database )

RDF Queries and Ontologies

Danny nicely puts the problem I'm trying to solve in his post titled SPARQL trick #23 . He says: > Running inference on the store’s data as a whole may be expensive, but without it many results for a given query may be missed. This is exactly why we are attracted to semantic web technologies. I have a lot of data, but I know there are many more pieces of information in there if I can apply some ontologies and rules. My queries against the system must search both the raw triples I have plus any triples that can be inferred by my ontologies. To me, this is one of the main value adds of the system. The other main value add of a RDF store vs. a traditional relational store is that it's much easier and cheaper to say arbitrary things. In a relational store, your schema must be defined up front, severly limiting your ability to define data in the future. With RDF, saying anything about anything is cheap. There are some solutions that work well for data sets that are static. ...