Sunday, April 19, 2009

Ye Olde Scala Presentation to Honolulu Coders

I presented the wonderfully named "Scala: Java, Erlang, and Ruby’s Hot Three Way Love Child" Scala presentation to the Honolulu Coders back in 2007. I went looking for the actual presentation online, and I'm not sure I ever posted it.

I had a brief fling with Scala as I was looking for multi-core friendly environments to build data processing frameworks. I came away being very impressed and look forward to deploying Scala in a future project. After Erlang and Scala, I started programming in a functional style back over in my Ruby code. I consider the experiments a win for that fact along.

This post is to prove that I knew Scala before it was cool. :P

Sunday, April 12, 2009

Extending Hadoop Pig for Hierarchical Data

I've been playing with Hadoop Pig lately, and having a fun time.  Pig is an easy to use language for writing Map Reduce jobs against Hadoop.

Our data is very hierarchical, and we calculate a lot of aggregates for self nodes, their children nodes, and self plus children.  We have a few tricks up our sleeves for SQL for handling these types of aggregates, but of course with Map Reduce an entirely new way of thinking is required.

Luckily, Pig allows for easily created User Defined Functions (UDFs) that extend the Pig language.  I was able to take an existing Pig UDF, TOKENIZE, and alter it to suite my needs.

Specifically, our data looks like this:

111,/A/B/C
222,/A/B
333,/A/B/C


We need to answer questions such as "How many records for A and all of its children?" In this case, the answer is three. We also need to answer "How many records for just A?" which is zero, or "for just C?" which is two.

Our strategy is to take the path (eg /A/B/C) and split it into all the different paths contained within. For example, /A/B/C can be split into:

  • /A
  • /A/B
  • /A/B/C
Now, we can take the original record (eg 111) and attribute it to all three paths that stem from the original /A/B/C. Once we have all paths, we can perform aggregate calculates the way Map Reduce wants to (which basically is sorted (grouped by) and then counted or summed). To see how this works, simply get the source code from Pig, look for the TOKENIZE class. Below are my modifications:

@Override
public DataBag exec(Tuple input) throws IOException {
try {
DataBag output = mBagFactory.newDefaultBag();
Object o = input.get(0);
if (!(o instanceof String)) {
int errCode = 2114;
String msg = "Expected input to be chararray, but" +
" got " + o.getClass().getName();
throw new ExecException(msg, errCode, PigException.BUG);
}
StringTokenizer tok = new StringTokenizer((String)o, (String)input.get(1), false);
StringBuilder sb = new StringBuilder();
while (tok.hasMoreTokens()) {
sb.append((String)input.get(1));
sb.append(tok.nextToken());
output.add(mTupleFactory.newTuple(sb.toString()));
}
return output;
} catch (ExecException ee) {
throw ee;
}
}


The end result now looks like this:

111,/A
111,/A/B
111,/A/B/C
222,/A
222,/A/B
333,/A
333,/A/B
333,/A/B/C


To learn more about writing your own Pig UDFs, consult the Pig UDF Manual.

Thursday, April 2, 2009

Mac OS X, Hadoop 0.19.1, and Java 1.6

If you're excited, like I am, about Amazon's recent announcement that they are now offering Elastic Map Reduce you probably want to try a quick Hadoop MapReduce application to test the waters. I found out quickly that if you are on a Mac (as I am) you'll need to perform a few quick configurations before things work correctly. Below is what I needed to do to get Hadoop running on my Mac with Java 1.6.

This post assumes you are running the latest Mac OS X 10.5 with all updates applied.

Enable Java 1.6 Support


To enable Java 1.6, open up the Java Preferences application. This can be found in /Applications/Utilities/Java Preferences.




You will need to drag Java 1.6 up and place at the top of both the applet and application versions.

Open up a terminal and type java -version and you should see something like the following:

java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

Configure JAVA_HOME


This was the tough part. Once you have Java 1.6 as the default runtime, you need to configure your JAVA_HOME environment variable to reflect this as well. Hadoop wants JAVA_HOME set before it will run correctly, otherwise you'll see:

Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file

This assumes you are using the standard bash shell. Open up ~/.bash_profile (or create it if it doesn't exist) and add this line:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

Save the file. Return to the terminal and type . ~/.bash_profile which will set your JAVA_HOME correctly. To verify, you can type: echo $JAVA_HOME.

While you are in .bash_profile you might as well add a HADOOP_HOME variable, which will make running hadoop from the command line much easier. Below is my full .bash_profile:

export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
export HADOOP_HOME=~sethladd/Development/hadoop-0.19.1

export PATH=$HADOOP_HOME/bin:$PATH

Note that I also placed $HADOOP_HOME/bin into my path, so now I can type hadoop from anywhere.

Now that you are all setup, try the simple WordCount Hadoop Tutorial.

Disclaimer

I'm probably required to say that the views expressed in this blog are my own, and do not necessarily reflect those of my employer. Also, except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the BSD License.