Posts

Showing posts from April, 2009

Ye Olde Scala Presentation to Honolulu Coders

I presented the wonderfully named "Scala: Java, Erlang, and Ruby’s Hot Three Way Love Child" Scala presentation to the Honolulu Coders back in 2007. I went looking for the actual presentation online, and I'm not sure I ever posted it.

I had a brief fling with Scala as I was looking for multi-core friendly environments to build data processing frameworks. I came away being very impressed and look forward to deploying Scala in a future project. After Erlang and Scala, I started programming in a functional style back over in my Ruby code. I consider the experiments a win for that fact along.

This post is to prove that I knew Scala before it was cool. :P

Extending Hadoop Pig for Hierarchical Data

I've been playing with HadoopPig lately, and having a fun time.  Pig is an easy to use language for writing Map Reduce jobs against Hadoop.

Our data is very hierarchical, and we calculate a lot of aggregates for self nodes, their children nodes, and self plus children.  We have a few tricks up our sleeves for SQL for handling these types of aggregates, but of course with Map Reduce an entirely new way of thinking is required.

Luckily, Pig allows for easily created User Defined Functions (UDFs) that extend the Pig language.  I was able to take an existing Pig UDF, TOKENIZE, and alter it to suite my needs.

Specifically, our data looks like this:

111,/A/B/C 222,/A/B 333,/A/B/C

We need to answer questions such as "How many records for A and all of its children?" In this case, the answer is three. We also need to answer "How many records for just A?" which is zero, or "for just C?" which is two.

Our strategy is to take the path (eg /A/B/C) and split it i…

Mac OS X, Hadoop 0.19.1, and Java 1.6

Image
If you're excited, like I am, about Amazon's recent announcement that they are now offering Elastic Map Reduce you probably want to try a quick Hadoop MapReduce application to test the waters. I found out quickly that if you are on a Mac (as I am) you'll need to perform a few quick configurations before things work correctly. Below is what I needed to do to get Hadoop running on my Mac with Java 1.6.

This post assumes you are running the latest Mac OS X 10.5 with all updates applied.
Enable Java 1.6 Support
To enable Java 1.6, open up the Java Preferences application. This can be found in /Applications/Utilities/Java Preferences.




You will need to drag Java 1.6 up and place at the top of both the applet and application versions.

Open up a terminal and type java -version and you should see something like the following:
java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed m…