Sunday, April 29, 2007
Friday, April 27, 2007
links for 2007-04-28
We are creating infrastructure to support ad-hoc analysis of very large data sets.
(tags: database)
Sunday, April 22, 2007
Saturday, April 21, 2007
Thursday, April 19, 2007
Monday, April 16, 2007
Calculating and Compressing OLAP Cubes
Over the weekend, I've been hard at work at writing an implementation of the Dwarf algorithm for compressing and calculating (aggregating) OLAP cubes. The initial results look good, and I've learned a great deal.
To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don't want to wait while you perform a SQL
The trouble is that the number of
So you can see where this gets a bit tricky. How do you effectively calculate all of your
This is where the Dwarf algorithm comes in. It promises to provide a way of calculating and compressing your cube with only one pass through your data. Sounds good to me! Let's try it.
First off, my implementation, which I've named BigDwarf, will become part of ActiveWarehouse. ActiveWarehouse is a Rails plugin which brings proven data warehouse techniques and conventions to Ruby on Rails. ActiveWarehouse follows the Rails way of "convention over configuration". BigDwarf is one of ActiveWarehouse's different aggregation strategies.
I've named the implementation BigDwarf because I've implemented the Dwarf algorithm in spirit only. Dwarf works so well because it does both prefix and suffix coalescing, or compression. I've implemented only prefix compression so far. Suffix coalescing, which apparently provides the most dramatic space reductions for a sparse cube, is on the TODO list.
The current implementation does not require ordering of the data set, which traditional Dwarf does. This is good and bad. It's good because we don't need to sort everything before loading, which can be a costly step. However, we'd probably just use the UNIX sort command to sort the file, with the assumption that it's faster than doing in the database (that's a big assumption). Loading the data in arbitrary order gives us a lot of flexibility.
However, it's bad because there's a great potential speed optimization we can use if we order all the dimension attributes in the file. It makes the algorithm a bit more complex, but I think it also reduces the amount of recursion. This is on the TODO list for further research and implementation.
Because BigDwarf is for ActiveWarehouse, this is all written in Ruby. Turns out Ruby is slow. Repeat after me: Ruby is slow. It's a beautiful language, but it's just not a data crunching language. I've managed to optimize BigDwarf enough so that the bottleneck now is the
*
*
*
Pretty much everything involving accessing your data in a collection will slow you down.
Once I've optimized BigDwarf enough where I can continue working on it, I ran some tests. Here's what we have so far. My test data set is a 10,037,355 line file extracted from our SQL Server 2005 database. The file includes 5 dimensional attributes and one fact (the number we want to sum across all of the dimensions). The file is a text file, tab delimited, one line per row.
BigDwarf processes this file at 4348 lines per second. It will store the fully calculated cube in 3,301,132 bytes. This is down from the original file's 337,022,624 bytes. That's a very dramatic compression, at approximately 99% compression rate. YMMV, of course, as dimension cardinality and size of the values in your data dictate much of that compression. The lower cardinality, the higher compression you'll see.
BigDwarf also supports querying, with basic filtering support. I've yet to do work to optimize the query performance, or to really get a sense of how fast it is. That's on my TODO as well.
All in all, BigDwarf is working really well so far. There's work to do for further compression through suffix coalescing and further optimization through smarter cube building.
To begin, a little background. OLAP cubes typically perform numerous calculations and aggregations on a dimensional model in order to speed up query performance. Data warehouses are all about storing extremely large sets of data, and presenting it to the user for analysis. Users being users, they don't want to wait while you perform a SQL
GROUP BY and SUM over 10 million rows. So an effective OLAP cube calculates all of the GROUP BY combinations ahead of time, dramatically speeding up users' queries against the cube.The trouble is that the number of
GROUP BY combinations increases exponentially as the numbers of dimensions (and dimension attributes) increases. Plus, data warehouses are meant to store data over multiple years, and at a very fine grain. Therefore, a typical data warehouse can easily store hundreds of millions of rows.So you can see where this gets a bit tricky. How do you effectively calculate all of your
GROUP BY combinations across such large data sets? How can you do it before the universe ends? How can you update your calculated cube with new rows?This is where the Dwarf algorithm comes in. It promises to provide a way of calculating and compressing your cube with only one pass through your data. Sounds good to me! Let's try it.
First off, my implementation, which I've named BigDwarf, will become part of ActiveWarehouse. ActiveWarehouse is a Rails plugin which brings proven data warehouse techniques and conventions to Ruby on Rails. ActiveWarehouse follows the Rails way of "convention over configuration". BigDwarf is one of ActiveWarehouse's different aggregation strategies.
I've named the implementation BigDwarf because I've implemented the Dwarf algorithm in spirit only. Dwarf works so well because it does both prefix and suffix coalescing, or compression. I've implemented only prefix compression so far. Suffix coalescing, which apparently provides the most dramatic space reductions for a sparse cube, is on the TODO list.
The current implementation does not require ordering of the data set, which traditional Dwarf does. This is good and bad. It's good because we don't need to sort everything before loading, which can be a costly step. However, we'd probably just use the UNIX sort command to sort the file, with the assumption that it's faster than doing in the database (that's a big assumption). Loading the data in arbitrary order gives us a lot of flexibility.
However, it's bad because there's a great potential speed optimization we can use if we order all the dimension attributes in the file. It makes the algorithm a bit more complex, but I think it also reduces the amount of recursion. This is on the TODO list for further research and implementation.
Because BigDwarf is for ActiveWarehouse, this is all written in Ruby. Turns out Ruby is slow. Repeat after me: Ruby is slow. It's a beautiful language, but it's just not a data crunching language. I've managed to optimize BigDwarf enough so that the bottleneck now is the
+ operator on Fixnum. Here are some things to avoid if you want to write fast Ruby code:*
==*
[] - array access*
Hash#[] - calculating hashesPretty much everything involving accessing your data in a collection will slow you down.
Once I've optimized BigDwarf enough where I can continue working on it, I ran some tests. Here's what we have so far. My test data set is a 10,037,355 line file extracted from our SQL Server 2005 database. The file includes 5 dimensional attributes and one fact (the number we want to sum across all of the dimensions). The file is a text file, tab delimited, one line per row.
BigDwarf processes this file at 4348 lines per second. It will store the fully calculated cube in 3,301,132 bytes. This is down from the original file's 337,022,624 bytes. That's a very dramatic compression, at approximately 99% compression rate. YMMV, of course, as dimension cardinality and size of the values in your data dictate much of that compression. The lower cardinality, the higher compression you'll see.
BigDwarf also supports querying, with basic filtering support. I've yet to do work to optimize the query performance, or to really get a sense of how fast it is. That's on my TODO as well.
All in all, BigDwarf is working really well so far. There's work to do for further compression through suffix coalescing and further optimization through smarter cube building.
Labels:
activewarehouse,
database,
olap
Saturday, April 14, 2007
links for 2007-04-15
Tips and tricks to get rid of the nasty smell in my washing machine.
(tags: house)
Friday, April 13, 2007
Thursday, April 12, 2007
Another Ruby ETL Project from Google’s Summer of Code
Google's Summer of Code is sponsoring a Framework for ETL and Data mining operations in Ruby.
Hmm... sounds a lot like ActiveWarehouse ETL, which is a Ruby library for ETL. ActiveWarehouse-ETL is already in progress and well on its way.
Here's the whole list of Ruby projects sponsored by Summer of Code.
Hmm... sounds a lot like ActiveWarehouse ETL, which is a Ruby library for ETL. ActiveWarehouse-ETL is already in progress and well on its way.
Here's the whole list of Ruby projects sponsored by Summer of Code.
Wednesday, April 11, 2007
Dabble DB Brings the Web of Data to Life
Dabble DB has completely blown me away. Dabble DB is like Club Med for your data. You want your data to get a massage while sipping a Mai Tai on the beach, you got it. Your data will get the five star treatment at Dabble DB.
So everyone is talking about Web 3.0, AKA the Web of Data, AKA the Semantic Web. Those visions are all well and good, and I do believe we'll see a Data Centric Web soon. But if there's a Web of Data, that must mean you've got Data on the Web.
What? Your data is buried in some SQL Server database on the company LAN? That doesn't sound very webby to me. And you're building all these custom, one-off, Visual Basic apps or Excel macros manage your data? Tisk, tisk. So not webby.
This is where Dabble DB comes in. Not only does it provide a very slick, dripping with AJAX interface for you to import and manage your data, it's a *very* smart interface. Normal muggles (Haven't read Harry Potter? Whaaaa?) can easily use Dabble DB to classify, link, sort, and visualize their data. Dabble DB is not a snazzy front end to a relational database system. Dabble DB is a snazzy front end to data.
Let's put it this way: I haven't seen a desktop application that helps you with your data like Dabble DB.
OK, enough of the uber love fest. Bringing it all back to the semantic web, Dabble DB might be in a class of killer applications for the semantic web. I really love Dave Beckett's description of the semantic web: "The semantic web is webby data." So the semantic web will need, as a killer app, something that makes managing a *linking* data so super easy and more importantly: incredibly rewarding.
That last statement is important. The killer application for the semantic web must be *rewarding*. That is, you will get out of it more than you put into it. Dabble DB does this to some extent, as you can graph your data, map your data, export your data, subscribe to your data.
It doesn't appear that Dabble exports to RDF, nor does it appear that you can link data together via ontologies. But if Dabble DB doesn't do that, someone else will. For data that is truly webby is data that can be extended by sources outside of your control.
At work, we've been building a large data warehouse, and the interface to go with it, so systems like Dabble DB are extremely interesting to me. I want to give my users an experience like Dabble.
So everyone is talking about Web 3.0, AKA the Web of Data, AKA the Semantic Web. Those visions are all well and good, and I do believe we'll see a Data Centric Web soon. But if there's a Web of Data, that must mean you've got Data on the Web.
What? Your data is buried in some SQL Server database on the company LAN? That doesn't sound very webby to me. And you're building all these custom, one-off, Visual Basic apps or Excel macros manage your data? Tisk, tisk. So not webby.
This is where Dabble DB comes in. Not only does it provide a very slick, dripping with AJAX interface for you to import and manage your data, it's a *very* smart interface. Normal muggles (Haven't read Harry Potter? Whaaaa?) can easily use Dabble DB to classify, link, sort, and visualize their data. Dabble DB is not a snazzy front end to a relational database system. Dabble DB is a snazzy front end to data.
Let's put it this way: I haven't seen a desktop application that helps you with your data like Dabble DB.
OK, enough of the uber love fest. Bringing it all back to the semantic web, Dabble DB might be in a class of killer applications for the semantic web. I really love Dave Beckett's description of the semantic web: "The semantic web is webby data." So the semantic web will need, as a killer app, something that makes managing a *linking* data so super easy and more importantly: incredibly rewarding.
That last statement is important. The killer application for the semantic web must be *rewarding*. That is, you will get out of it more than you put into it. Dabble DB does this to some extent, as you can graph your data, map your data, export your data, subscribe to your data.
It doesn't appear that Dabble exports to RDF, nor does it appear that you can link data together via ontologies. But if Dabble DB doesn't do that, someone else will. For data that is truly webby is data that can be extended by sources outside of your control.
At work, we've been building a large data warehouse, and the interface to go with it, so systems like Dabble DB are extremely interesting to me. I want to give my users an experience like Dabble.
Labels:
database,
semantic web,
web2.0
links for 2007-04-12
CSS framework, gets me out of the business of micromanaging my web page layouts.
(tags: css)
Cheat sheet PDF for Yahoo Grids CSS layout framework.
(tags: css)
Javascript framework and library that contains widgets.
(tags: css javascript)
Mims is hot like Mims. Does it get hotter than that?
(tags: music)
Tuesday, April 10, 2007
Excellent REST Tutorial Series
Softies on Rails is posting an excellent REST tutorial series. They are doing a great job in describing how a web application is just a collection of resources (and not necessarily web pages), and how you can think of your web application as the gateway to manipulate those resources. If you want to learn REST, and good web architecture, this tutorial series is the way to start.
This reminds me of the great "A resource is a web page about your car, not your car itself" debate.
Norm Walsh sums this debate up nicely:
Suffice it to say that some folks think http://example.org/some/path must identify “a document” (what the AWWW calls an “information resource”) and some folks think it can identify anything at all.
In any case, thinking of your web application as a collection of resources is the way to go. You don't usually need to get all carried away with "what is a resource?" unless you are trying to model knowledge, but that's another story...
This reminds me of the great "A resource is a web page about your car, not your car itself" debate.
Norm Walsh sums this debate up nicely:
Suffice it to say that some folks think http://example.org/some/path must identify “a document” (what the AWWW calls an “information resource”) and some folks think it can identify anything at all.
In any case, thinking of your web application as a collection of resources is the way to go. You don't usually need to get all carried away with "what is a resource?" unless you are trying to model knowledge, but that's another story...
Labels:
rubyonrails,
web
Monday, April 9, 2007
Ruby on Rails Mangles Decimal Type to Integer
So when is a DECIMAL not a DECIMAL? Go ahead, take your time. I'll wait.
A decimal type is converted to an integer by Ruby on Rails through its infinite wisdom. As seen in
when /decimal|numeric|number/i
extract_scale(field_type) == 0 ? :integer : :decimal
So, apparently, Rails thinks, "Well, I know you specified a decimal type, but hey, you didn't specify a scale, so you must have made a typo! Here, let me just change that to integer for you."
Or, in other words:
You: I'd like a hamburger with nothing on it.
Rails: OK, here's your hot dog.
You: WTF??!?
Apparently, this issue has been reported before. However, as of Version 1.2.3 of Rails, this issue still exists.
I've just opened a new issue on the Rails Trac site. You can see if any of the core developers will see why converting a decimal to an integer is completely counter intuitive.
A decimal type is converted to an integer by Ruby on Rails through its infinite wisdom. As seen in
active\_record\connection\_adapters\abstract\schema\_definitions.rb, line 180 (in Rails 1.2.3):when /decimal|numeric|number/i
extract_scale(field_type) == 0 ? :integer : :decimal
So, apparently, Rails thinks, "Well, I know you specified a decimal type, but hey, you didn't specify a scale, so you must have made a typo! Here, let me just change that to integer for you."
Or, in other words:
You: I'd like a hamburger with nothing on it.
Rails: OK, here's your hot dog.
You: WTF??!?
Apparently, this issue has been reported before. However, as of Version 1.2.3 of Rails, this issue still exists.
I've just opened a new issue on the Rails Trac site. You can see if any of the core developers will see why converting a decimal to an integer is completely counter intuitive.
Labels:
rubyonrails
Friday, April 6, 2007
links for 2007-04-07
Specific instructions for encoding H.264 video using Quicktime Pro. Useful for encoding to YouTube.
Thursday, April 5, 2007
links for 2007-04-06
Nice introduction to behavior driven development (BDD).
(tags: programming)
compression suggestions and techniques for youtube
Monday, April 2, 2007
links for 2007-04-03
A blank shell for up to date, best practices, Ruby on Rails applications.
(tags: rubyonrails)
Subscribe to:
Posts (Atom)