Posts

Showing posts with the label mapreduce

Extending Hadoop Pig for Hierarchical Data

I've been playing with Hadoop Pig lately, and having a fun time.  Pig is an easy to use language for writing Map Reduce jobs against Hadoop. Our data is very hierarchical, and we calculate a lot of aggregates for self nodes, their children nodes, and self plus children.  We have a few tricks up our sleeves for SQL for handling these types of aggregates, but of course with Map Reduce an entirely new way of thinking is required. Luckily, Pig allows for easily created User Defined Functions (UDFs) that extend the Pig language.  I was able to take an existing Pig UDF, TOKENIZE, and alter it to suite my needs. Specifically, our data looks like this: 111,/A/B/C 222,/A/B 333,/A/B/C We need to answer questions such as "How many records for A and all of its children?" In this case, the answer is three. We also need to answer "How many records for just A?" which is zero, or "for just C?" which is two. Our strategy is to take the path (eg /A/B/C )...

Reasons Why CouchDB Is Exciting

I've been playing with CouchDB lately, and having a good time getting out of the relational database mindset. I'm not against relational databases, but I do think the time has come to stop thinking they are the only way to store and process data. There are a few reasons why I think CouchDB is exciting. Arbitrary JSON document storage. I like this feature not so much because I'm not required to design a rigid schema, but because it implies that I can store arbitrarily complex and deeply nested data structures. This clears the way for easily storing an object graph without the pains of going through the pains of decomposing into a relational schema. Incremental view rebuilds. This is a killer feature for me, because it implies that a small change in the data does not invalidate the entire view. Keys can be complex objects. In other words, you can now use an array or hash as a key in a view. For example, a key can be [2008, 01, 12] which represents a date. N...

When will Google let us run our own Map/Reduce programs?

After reading about Sam Ruby's issues with JSON for Map/Reduce , it got me thinking. How long before Google will let us run our own Map/Reduce programs on their clusters? We all know, one of the best ways to scale is to push the operation to the data. And who has all the data? Why, that's Google, of course. They are in a perfect position to host and run applications which map/reduce the web for individuals or organizations. Imagine that I want to start collecting results by microformats such as hCard . It would be nice to formulate a query which understands microformats as a map/reduce and send it off to Google. Facebook got this right by creating a Platform . Developers can write applications which link directly into Facebook. I want to do the same with Google.