Tuesday, September 25, 2007

When will Google let us run our own Map/Reduce programs?

After reading about Sam Ruby's issues with JSON for Map/Reduce, it got me thinking. How long before Google will let us run our own Map/Reduce programs on their clusters?

We all know, one of the best ways to scale is to push the operation to the data. And who has all the data? Why, that's Google, of course. They are in a perfect position to host and run applications which map/reduce the web for individuals or organizations.

Imagine that I want to start collecting results by microformats such as hCard. It would be nice to formulate a query which understands microformats as a map/reduce and send it off to Google.

Facebook got this right by creating a Platform. Developers can write applications which link directly into Facebook. I want to do the same with Google.

Monday, September 24, 2007

Semantic Web Use Case #32354343

I should be able to tell Flickr to allow viewing of my photos to any of my Facebook friends.

links for 2007-09-25

I Second That Emotion

So Tim Bray finds out that Erlang IO is slow. I can attest to this fact, as my recent work on reading large files in Erlang has shown that IO and string manipulation is much slower than I would have wanted.

Yes, like Bray, my file reading is single threaded (although, what I do with the line is very multi-threaded) so I suppose using a single thread for Erlang isn't very Erlang-like in the first place.

In the meantime, I'm porting my OLAP cube generator to Scala. The assumption (and shortly, hopefully proof) is that the JVM can do file IO much better than Erlang, yet I can still take advantage of Scala's Actors to retain my concurrency.

Update: OK, some numbers and code. This is a benchmark for Erlang and Scala to read in a file line by line.

First, the Erlang code:


process_file2(Filename) ->
{ok, File} = file:open(Filename, read),
process_lines2(File).

process_lines2(File) ->
case io:get_line(File, '') of
eof -> file:close(File);
_ -> process_lines2(File)
end.


Now the Scala code:


object LineReader {

def foreachline(in: BufferedReader, f: String => Unit): Unit = {
val line = in.readLine()
if (line == null) return
else f(line)
foreachline(in, f)
}

def forLines(filename: String, f: String => Unit) = {
val in = new BufferedReader(new FileReader(filename))
foreachline(in, f)
in.close()
}

}


OK, so these aren't exactly the same. The Scala example is dispatching to a function, so Scala is even at a disadvantage.

The timings, three runs each, on my MacBook Pro 2.2 Ghz Intel Core 2 Duo. Erlang is the BEAM emulator 5.5.5 and Scala is 2.6 running on JDK 1.5 on Mac OS X. Erlang code was compiled with HIPE.

I am reading in a 1028071833 bytes file with 10037355 lines.




















CodeRun 1Run 2Run 3
Erlang205.830 sec208.999 sec207.454 sec
Java36.094 sec39.917 sec34.337 sec

Sunday, September 23, 2007

QOTD

7 reasons I switched back to PHP after 2 years on Rails


PROGRAMMING LANGUAGES ARE LIKE GIRLFRIENDS: THE NEW ONE IS BETTER BECAUSE *YOU* ARE BETTER

Thursday, September 20, 2007

Now Using OpenDNS

I've just configured my home router to use OpenDNS. It's a pretty cool service. OpenDNS is a suite of value added DNS services, like phishing detection and fine grained control over what resolves.

I switched because ever since I switched to DSL, my DNS lookups have become very slow. It's not the kind of thing you notice until, well, you notice that it's too slow. Switching to OpenDNS fixed that problem. Now I don't notice the DNS lookups and everything continues to function.

links for 2007-09-21

No SMP Erlang for Windows?

I was trying to enable SMP on my Windows Erlang install, but I was shocked to find out that SMP is not supported in the official Windows Erlang distribution.

I confirmed this by reading the Errata for Programming Erlang:


#8893: "SMP Erlang has been enabled ... since R11B-0 on all platforms where SMP Erlang is known to work."
PLEASE mention that Windows is NOT among them.


What the heck? That seems like a horrible little secret of the Erlang world. I don't want to compile my own Erlang on windows just to get SMP (which, imo, is the killer feature of erlang).

Semantic Web Doesn’t Have to Be Difficult

After reading Semantic Web: Difficulties with the Classic Approach, I am even more certain that we're putting too many expectations on the semantic web. The semantic web doesn't have to be difficult to build or use. It simply starts with resetting expectations and re-branding.

To start, the semantic web needs to be re-branded as the Data Web. Now take a deep breath. Doesn't the air feel lighter and taste sweeter? That's because the heaviness of the baggage brought along by the word "semantic" is gone. People see semantic and go all screwy: "Replace humans with computers?" or "How do you deal with uncertainty?" or "How do we agree on what we mean by agreement?" or "A.I. never worked."

Even Tim B.L. thinks that the name "semantic web" isn't very good:


I don't think it's a very good name but we're stuck with it now. The word semantics is used by different groups to mean different things. But now people understand that the Semantic Web is the Data Web. I think we could have called it the Data Web. It would have been simpler.


What does it mean to have a data web? To me, it means that the underlying data that powers the web page/site/application is exposed to the web via URIs. The data web is about pulling up all those databases that live under a web application and placing them squarely on the web. Placing something on the web simply means giving it a URI and, often times, making sure a representation is returned when you dereference the URI.

We already have databases, we already have web servers, we already have HTTP, we already have URIs. The pieces are in place. We just aren't in the habit of publishing machine readable data, as often times the data is seen as the heavily protected intellectual property. This is a mind set issue that will be changed over time as people and businesses figure out how to make money off of data (hey Google, figure out AdSense for RDF or connect all the data together, expose it to end users, and place ads on it (or wait, they already do that)).

Repeat after me: Data Web, Data Web, Data Web. Put my data on the web. Give it a URI. Create a Web of Data.

Semantics: old 'n busted. Data: the new hotness.

(note to self, put money where mouth is)

Wednesday, September 19, 2007

Blog, Meet SliceHost

My little Shuttle PC, acting as the server for this blog, is acting strange and just hanging up. A reboot fixes the problem, but a day or two later and *poof* it's gone off the network.

Instead of tinkering around with the hardware to find out what's wrong, I just moved to SliceHost. They offer Linux VPS servers at very reasonable prices. I couldn't find a cheaper host that offered the same specs for the same price.

A bonus of moving to SliceHost is that I'll get a lot more bandwidth.

QOTD

Joel points out:


... Ajax apps can be inconsistent, and have a lot of trouble working together — you can’t really cut and paste objects from one Ajax app to another, for example, so I’m not sure how you get a picture from Gmail to Flickr. Come on guys, Cut and Paste was invented 25 years ago.

Monday, September 17, 2007

Beautiful Programs

If you've been reading Beautiful Code, and I hope you have, you should also read Donald Knuth's Computer Programming as an Art. It actually makes you feel better about writing all that code and always trying to get it right.


The possibility of writing beautiful programs, even in assembly language, is what got me hooked on programming in the first place.

Wednesday, September 12, 2007

Tail Recursive Line Processing in Erlang

When I was testing performance in Erlang using funs, I mentioned I wanted to see what would happen if I made the functions tail recursive. I just took a crack at it, and it looks like making this particular function tail recursive didn't help performance. (as always, please let me know if my assumptions are incorrect.)

Tail Recursive Line Processing




process_file(Filename) ->
{ok, File} = file:open(Filename, read),
process_lines(File, first, 0).

process_line(eof) -> done;
process_line(Line) ->
L = string:tokens(string:strip(Line, both, $\n), " "),
{Dimensions, Measures} = lists:split(10, L),
lists:map(fun(X) -> {I,_} = string:to_integer(X), I end, Measures).

process_lines(_, eof, _) -> done;
process_lines(File, PreviousLine, LineNum) ->
Line = io:get_line(File, ''),
process_line(Line),
process_lines(File, Line, LineNum + 1).


Running this code agains a file of 10037355 lines and 1028071833 bytes big, it takes on average 401.40 seconds. This is only slightly better than the previous code attempts that are not tail recursive (and imo easier to read).

Erlang Fun Results

As I code more and more in Erlang, I'm very interested how to achieve the best performance possible. I'm very new to the language, so I'm still unsure what the best practices are concerning performance. Tonight I was testing the effects of using a fun() in an algorithm.

My test concerns reading a tab delimited text file, tokenizing it, and converting any numbers into integers. I've split the program into two conceptual parts: 1) the file IO and line reading, and 2) the handling of the line. I wanted to test the performance differences between using a fun() for the line handling vs just including the line handling code directly.

My test text file is 10037355 lines long and 1028071833 bytes big. I compiled my code using HIPE.

The quick answer is, using a fun() is slightly slower than not using it (which is to be expected). For my particular test, using a fun() was approximately 20% slower.















testRun 1 (sec)Run 2 (sec)
fun()484.631485.380
without fun()404.017403.632


Here's the code. I stole the erlang timing functions from David King (thanks David!).

Using a fun()




time_takes(Mod,Fun,Args) ->
Start=erlang:now(),
Result = apply(Mod,Fun,Args),
Stop=erlang:now(),
io:format("~p~n",[time_diff(Start,Stop)]),
Result.

time_diff({A1,A2,A3}, {B1,B2,B3}) ->
(B1 - A1) * 1000000 + (B2 - A2) + (B3 - A3) / 1000000.0 .

handle_line(Line, SplitOn) ->
L = string:tokens(string:strip(Line, both, $\n), " "),
{Dimensions, Measures} = lists:split(SplitOn, L),
lists:map(fun(X) -> {I,_} = string:to_integer(X), I end, Measures).

process_file(Filename, Proc) ->
{ok, File} = file:open(Filename, read),
process_lines(File, Proc, 0).


process_lines(File, Proc, LineNum) ->
case io:get_line(File, '') of
eof -> file:close(File);
Line ->
Proc(Line),
process_lines(File, Proc, LineNum + 1)
end.


Including Code Directly (no fun())



(I'm just showing the difference.)


process_lines(File, Proc, LineNum) ->
case io:get_line(File, '') of
eof -> file:close(File);
Line ->
L = string:tokens(string:strip(Line, both, $\n), " "),
{Dimensions, Measures} = lists:split(10, L),
lists:map(fun(X) -> {I,_} = string:to_integer(X), I end, Measures),
process_lines(File, Proc, LineNum + 1)
end.


Of course, it's easy to argue that Erlang isn't the best language for string manipulation. But this part of the application is hardly the bottleneck, so I'm willing to take the bloat in order to take advantage of the concurrency later on.

Next up, I'll do timing experiments testing if tail recursion speeds anything up.

Tuesday, September 11, 2007

Enabling SMP Support for Erlang on Mac OS X

If you are working with Erlang on Mac OS X and have installed it via Mac Ports, then you might not be running the SMP enabled erlang.

To check, start erlang as


erlang -smp


If you get:


Argument '-smp' not supported.


Then your erlang was not compiled with SMP. All you'll need to do is:


sudo port uninstall erlang
sudo port install erlang +smp


Then when you run erlang -smp, you'll be dropped right into the erlang shell. When you run anything with multiple processes, you'll see both of your CPU cores active.

GRDDL Is Out, How To Integrate With SPARQL

GRDDL is out, providing a mechanism for providing instructions to convert documents on the web into RDF. In short, GRDDL allows you to link an XSLT transform to your XHTML page, which converts the XHTML into an RDF document. For more information, start at the GRDDL Primer.

(The irony is that while you can use XSLT to convert into RDF, you can't ever use XSLT to convert RDF into something else with complete certainly because RDF/XML output is nondeterministic.)

The primer includes a few examples of using SPARQL to query the RDF document generated by a GRDDL transform. Here's an example from the primer:


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rev: <http://www.purl.org/stuff/rev#>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?rating ?name ?region ?hotelname

FROM <http://www.w3.org/TR/grddl-primer/hotel-data.rdf>

WHERE {
?x rev:hasReview ?review;
vcard:ADR ?address;
vcard:FN ?hotelname .
?review rev:rating ?rating .
?address vcard:Locality ?region.

FILTER (?rating > "2").

?review rev:reviewer ?reviewer.
?reviewer foaf:name ?name;
foaf:homepage ?homepage

}


Looking at the FROM line, you see that we are referencing an RDF document by URI. However, if we are using GRDDL, that document doesn't exist until after we perform any transforms.

This means we can't use GRDDL directly in our SPARQL queries, as there isn't a physical RDF document to reference.

However, using the ever useful GRDDL Service, which is an online web service (lower case web service :) to generate RDF from documents using GRDDL, we could integrate GRDDL enabled documents directly into our SPARQL queries.

Let's replace the FROM clause in our original SPARQL query with the direct URI to the RDF document (instead of the generated "middle man" RDF document).


PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rev: <http://www.purl.org/stuff/rev#>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?rating ?name ?region ?hotelname

FROM <http://www.w3.org/2007/08/grddl/?docAddr=http%3A%2F%2Fwww.w3.org%2FTR%2Fgrddl-primer%2Fhotel-data.html&output=rdfxml>

WHERE {
?x rev:hasReview ?review;
vcard:ADR ?address;
vcard:FN ?hotelname .
?review rev:rating ?rating .
?address vcard:Locality ?region.

FILTER (?rating > "2").

?review rev:reviewer ?reviewer.
?reviewer foaf:name ?name;
foaf:homepage ?homepage

}


There, now isn't that much more webby? I think the success of GRDDL lies with the integration into existing RDF toolkits. Otherwise, it's a two step process to get documents off the web, transformed into RDF, and then into RDF tools.

For XHTML documents, though, my money is with RDFa. I think linking XSLT to XHTML is just too complicated and brittle (hmm, rhymes with GRDDL) for the masses OR for the tools. RDFa at least lets me directly embed the markup inside my XHTML documents, which makes it much easier to change when I change the XHTML. Plus, as my tools will dynamically generate the XHTML (think template languages for web frameworks) I can easily embed the RDFa right into the templates. Plus, I'm already using CSS and CSS classes, which RDFa encourages, so I can ride off of that investment.

Monday, September 10, 2007

OLAP Cube Construction with Erlang

I've managed to build the parallel OLAP cube constructor in Erlang. This program achieves parallelization through creating a process for every dimension in the OLAP cube. Each process manages the file that holds the dimension data. Messages are passed from the first dimension all the way down to the last dimension which stores the measures themselves. To further parallelize things, you can partition any dimension using a modulus, which creates another file and process. This helps get around the 2 GB limit for dets tables.

I also have basic path based querying working, which is also parallelized through sending the query message through each dimension. While the querying itself isn't parallel for a particular client, it will theoretically scale to handle many clients.

When I move to move traditional querying to generate a traditional tabular result set, I will be able to parallelize the query for a single client.

I'll post the working code once I can choose a suitable license. I'm very interested to hear feedback, as I'm very much still an Erlang n00b.

Next up I'll generate some performance numbers to see if this thing will actually perform in the real world.

I have to say, functional programming is great when your solution is algorithmic. Previous implementations of mine were done in Java or Ruby, which are object oriented. The classes and object obscured the algorithm, which in the case of OLAP cubes is the primary focus.

Sunday, September 9, 2007

Parallel OLAP Cube Construction

Still tilting at windmills here, with my quest to calculate an OLAP cube still trotting along. Lately I've been using Erlang to see if I can use its concurrent programming abilities to scale the OLAP cube generation. The initial progress is promising, with clear concurrency complete.

I hope to post the majority of the code soon, after I run a few more tests. For now, here's how I add two lists of numbers in Erlang. Please let me know if there's a better way.

Update:

The best way to do this is (thanks to Daniel Larsson):

lists:zipwith(fun(X,Y) -> X+Y end, [1,2],[3,4])

Old 'n busted way:

add_measures(OldMeasures, NewMeasures) ->
add_measures(OldMeasures, NewMeasures, []).

add_measures([Old|OldRest], [New|NewRest], Accum) ->
add_measures(OldRest, NewRest, [Old+New|Accum]);

add_measures([], [], Accum) -> lists:reverse(Accum).

Saturday, September 8, 2007

My Other Spring 2 Book is Finally Out

This is a blast from the past. Way back in January 2006 I wrote a chapter for what was then called Beginning Spring 2. I sent the chapter along in draft form for initial feedback. I never heard back, so I figured the author or publisher dropped the book.

Well, the other day I received a box of three copies of Building Spring 2 Enterprise Applications. And my name was squarely on the cover. Who knew?

First off, I want to thank whoever did the review and editing of the chapter. Those are tough jobs. Also thanks to the rest of the authors for pulling together to push the book out.

I'm still not sure what the story is behind the book or my chapter. I never had a chance to edit the chapter or review it before publication. The title has changed. The original author left the project. So I want to apologize if it doesn't make any sense or has errors.

I thumbed through the book, and all that Spring came crashing back to me. See, the secret is, ever since I wrote Expert Spring MVC and Web Flow, I have been using Ruby on Rails. So checking out this new Spring book has made me realize how much easier it is to write web applications in Rails than Spring. Don't get me wrong, the Spring Framework has the best engineered code I've ever seen (certainly much better than Rails core code), and I learned a lot about the right way to construct a framework. I still maintain that Spring is a better choice if you have to integrate with many different technologies (whether you like it or not). But as always, choose the right tool for the job.

That said, I'm happy Building Spring 2 Enterprise Applications finally made it out.

New Server Setup

Thanks to my buddy at Pau Spam, I'm now hosting this blog on my own dedicated box.  I had an old Shuttle PC, so I loaded it up with plenty of Hard Drive space, put Ubuntu Server (Ubuntu optimized for servers) on it, and now I feel free as a nerdy bird.

I want to give a shout out to Pau Spam.  They are a local Hawaiian company offering spam filtering as a service, among other services.  If you need to stop spam from ever seeing your inbox, contact Pau Spam.  Plus, they are really nice guys.

As I was moving my server over from my previous VPS, I realized how many little services or plugins I use.  Setup was a bit slower than I had anticipated, but in the end, fairly quick and easy.

So, a quick list of some of the services that I use:

  • EditDNS - for hosting my DNS.  I could do this myself, but they have a good interface and run multiple DNS servers.  And it's free.

  • PacNames - DNS name registration.  I don't know why I use them, I just always have.

  • Ubuntu Server - Ubuntu.  Server.  Easy.

  • Wordpress - Blog software

  • Feedburner - for tracking feed subscriptions.  Easy, free, and fun to watch the stats.

  • Feedburner Feedsmith plugin for Wordpress - links wordpress feeds to feedburner


OK, back to work!

Friday, September 7, 2007

links for 2007-09-08

Disclaimer

I'm probably required to say that the views expressed in this blog are my own, and do not necessarily reflect those of my employer.