Showing posts from September, 2007

When will Google let us run our own Map/Reduce programs?

After reading about Sam Ruby's issues with JSON for Map/Reduce, it got me thinking. How long before Google will let us run our own Map/Reduce programs on their clusters?

We all know, one of the best ways to scale is to push the operation to the data. And who has all the data? Why, that's Google, of course. They are in a perfect position to host and run applications which map/reduce the web for individuals or organizations.

Imagine that I want to start collecting results by microformats such as hCard. It would be nice to formulate a query which understands microformats as a map/reduce and send it off to Google.

Facebook got this right by creating a Platform. Developers can write applications which link directly into Facebook. I want to do the same with Google.

Semantic Web Use Case #32354343

I should be able to tell Flickr to allow viewing of my photos to any of my Facebook friends.

links for 2007-09-25

I Second That Emotion

So Tim Bray finds out that Erlang IO is slow. I can attest to this fact, as my recent work on reading large files in Erlang has shown that IO and string manipulation is much slower than I would have wanted.

Yes, like Bray, my file reading is single threaded (although, what I do with the line is very multi-threaded) so I suppose using a single thread for Erlang isn't very Erlang-like in the first place.

In the meantime, I'm porting my OLAP cube generator to Scala. The assumption (and shortly, hopefully proof) is that the JVM can do file IO much better than Erlang, yet I can still take advantage of Scala's Actors to retain my concurrency.

Update: OK, some numbers and code. This is a benchmark for Erlang and Scala to read in a file line by line.

First, the Erlang code:

process_file2(Filename) ->
{ok, File} = file:open(Filename, read),

process_lines2(File) ->
case io:get_line(File, '') of
eof -> file:close(File);
_ -> process_li…


7 reasons I switched back to PHP after 2 years on Rails


Now Using OpenDNS

I've just configured my home router to use OpenDNS. It's a pretty cool service. OpenDNS is a suite of value added DNS services, like phishing detection and fine grained control over what resolves.

I switched because ever since I switched to DSL, my DNS lookups have become very slow. It's not the kind of thing you notice until, well, you notice that it's too slow. Switching to OpenDNS fixed that problem. Now I don't notice the DNS lookups and everything continues to function.

links for 2007-09-21

No SMP Erlang for Windows?

I was trying to enable SMP on my Windows Erlang install, but I was shocked to find out that SMP is not supported in the official Windows Erlang distribution.

I confirmed this by reading the Errata for Programming Erlang:

#8893: "SMP Erlang has been enabled ... since R11B-0 on all platforms where SMP Erlang is known to work."
PLEASE mention that Windows is NOT among them.

What the heck? That seems like a horrible little secret of the Erlang world. I don't want to compile my own Erlang on windows just to get SMP (which, imo, is the killer feature of erlang).

Semantic Web Doesn’t Have to Be Difficult

After reading Semantic Web: Difficulties with the Classic Approach, I am even more certain that we're putting too many expectations on the semantic web. The semantic web doesn't have to be difficult to build or use. It simply starts with resetting expectations and re-branding.

To start, the semantic web needs to be re-branded as the Data Web. Now take a deep breath. Doesn't the air feel lighter and taste sweeter? That's because the heaviness of the baggage brought along by the word "semantic" is gone. People see semantic and go all screwy: "Replace humans with computers?" or "How do you deal with uncertainty?" or "How do we agree on what we mean by agreement?" or "A.I. never worked."

Even Tim B.L. thinks that the name "semantic web" isn't very good:

I don't think it's a very good name but we're stuck with it now. The word semantics is used by different groups to mean different things. But n…

Blog, Meet SliceHost

My little Shuttle PC, acting as the server for this blog, is acting strange and just hanging up. A reboot fixes the problem, but a day or two later and *poof* it's gone off the network.

Instead of tinkering around with the hardware to find out what's wrong, I just moved to SliceHost. They offer Linux VPS servers at very reasonable prices. I couldn't find a cheaper host that offered the same specs for the same price.

A bonus of moving to SliceHost is that I'll get a lot more bandwidth.


Joel points out:

... Ajax apps can be inconsistent, and have a lot of trouble working together — you can’t really cut and paste objects from one Ajax app to another, for example, so I’m not sure how you get a picture from Gmail to Flickr. Come on guys, Cut and Paste was invented 25 years ago.

Beautiful Programs

If you've been reading Beautiful Code, and I hope you have, you should also read Donald Knuth's Computer Programming as an Art. It actually makes you feel better about writing all that code and always trying to get it right.

The possibility of writing beautiful programs, even in assembly language, is what got me hooked on programming in the first place.

links for 2007-09-14

Tail Recursive Line Processing in Erlang

When I was testing performance in Erlang using funs, I mentioned I wanted to see what would happen if I made the functions tail recursive. I just took a crack at it, and it looks like making this particular function tail recursive didn't help performance. (as always, please let me know if my assumptions are incorrect.)

Tail Recursive Line Processing

process_file(Filename) ->
{ok, File} = file:open(Filename, read),
process_lines(File, first, 0).

process_line(eof) -> done;
process_line(Line) ->
L = string:tokens(string:strip(Line, both, $\n), " "),
{Dimensions, Measures} = lists:split(10, L),
lists:map(fun(X) -> {I,_} = string:to_integer(X), I end, Measures).

process_lines(_, eof, _) -> done;
process_lines(File, PreviousLine, LineNum) ->
Line = io:get_line(File, ''),
process_lines(File, Line, LineNum + 1).

Running this code agains a file of 10037355 lines and 1028071833 bytes big, it takes on average 401.40 seconds. This is only sl…

Erlang Fun Results

As I code more and more in Erlang, I'm very interested how to achieve the best performance possible. I'm very new to the language, so I'm still unsure what the best practices are concerning performance. Tonight I was testing the effects of using a fun() in an algorithm.

My test concerns reading a tab delimited text file, tokenizing it, and converting any numbers into integers. I've split the program into two conceptual parts: 1) the file IO and line reading, and 2) the handling of the line. I wanted to test the performance differences between using a fun() for the line handling vs just including the line handling code directly.

My test text file is 10037355 lines long and 1028071833 bytes big. I compiled my code using HIPE.

The quick answer is, using a fun() is slightly slower than not using it (which is to be expected). For my particular test, using a fun() was approximately 20% slower.

testRun 1 (sec)Run 2 (sec)


without fun()404.017403.632


Enabling SMP Support for Erlang on Mac OS X

If you are working with Erlang on Mac OS X and have installed it via Mac Ports, then you might not be running the SMP enabled erlang.

To check, start erlang as

erlang -smp

If you get:

Argument '-smp' not supported.

Then your erlang was not compiled with SMP. All you'll need to do is:

sudo port uninstall erlang
sudo port install erlang +smp

Then when you run erlang -smp, you'll be dropped right into the erlang shell. When you run anything with multiple processes, you'll see both of your CPU cores active.

GRDDL Is Out, How To Integrate With SPARQL

GRDDL is out, providing a mechanism for providing instructions to convert documents on the web into RDF. In short, GRDDL allows you to link an XSLT transform to your XHTML page, which converts the XHTML into an RDF document. For more information, start at the GRDDL Primer.

(The irony is that while you can use XSLT to convert into RDF, you can't ever use XSLT to convert RDF into something else with complete certainly because RDF/XML output is nondeterministic.)

The primer includes a few examples of using SPARQL to query the RDF document generated by a GRDDL transform. Here's an example from the primer:

PREFIX rdfs: <>
PREFIX rev: <>
PREFIX vcard: <>
PREFIX dc: <>
PREFIX foaf: <>

SELECT DISTINCT ?rating ?name ?region ?hotelname


?x rev…

OLAP Cube Construction with Erlang

I've managed to build the parallel OLAP cube constructor in Erlang. This program achieves parallelization through creating a process for every dimension in the OLAP cube. Each process manages the file that holds the dimension data. Messages are passed from the first dimension all the way down to the last dimension which stores the measures themselves. To further parallelize things, you can partition any dimension using a modulus, which creates another file and process. This helps get around the 2 GB limit for dets tables.

I also have basic path based querying working, which is also parallelized through sending the query message through each dimension. While the querying itself isn't parallel for a particular client, it will theoretically scale to handle many clients.

When I move to move traditional querying to generate a traditional tabular result set, I will be able to parallelize the query for a single client.

I'll post the working code once I can choose a suitable li…

Parallel OLAP Cube Construction

Still tilting at windmills here, with my quest to calculate an OLAP cube still trotting along. Lately I've been using Erlang to see if I can use its concurrent programming abilities to scale the OLAP cube generation. The initial progress is promising, with clear concurrency complete.

I hope to post the majority of the code soon, after I run a few more tests. For now, here's how I add two lists of numbers in Erlang. Please let me know if there's a better way.


The best way to do this is (thanks to Daniel Larsson):

lists:zipwith(fun(X,Y) -> X+Y end, [1,2],[3,4])

Old 'n busted way:

add_measures(OldMeasures, NewMeasures) ->
add_measures(OldMeasures, NewMeasures, []).

add_measures([Old|OldRest], [New|NewRest], Accum) ->
add_measures(OldRest, NewRest, [Old+New|Accum]);

add_measures([], [], Accum) -> lists:reverse(Accum).

My Other Spring 2 Book is Finally Out

This is a blast from the past. Way back in January 2006 I wrote a chapter for what was then called Beginning Spring 2. I sent the chapter along in draft form for initial feedback. I never heard back, so I figured the author or publisher dropped the book.

Well, the other day I received a box of three copies of Building Spring 2 Enterprise Applications. And my name was squarely on the cover. Who knew?

First off, I want to thank whoever did the review and editing of the chapter. Those are tough jobs. Also thanks to the rest of the authors for pulling together to push the book out.

I'm still not sure what the story is behind the book or my chapter. I never had a chance to edit the chapter or review it before publication. The title has changed. The original author left the project. So I want to apologize if it doesn't make any sense or has errors.

I thumbed through the book, and all that Spring came crashing back to me. See, the secret is, ever since I wrote Expert Spring M…

New Server Setup

Thanks to my buddy at Pau Spam, I'm now hosting this blog on my own dedicated box.  I had an old Shuttle PC, so I loaded it up with plenty of Hard Drive space, put Ubuntu Server (Ubuntu optimized for servers) on it, and now I feel free as a nerdy bird.

I want to give a shout out to Pau Spam.  They are a local Hawaiian company offering spam filtering as a service, among other services.  If you need to stop spam from ever seeing your inbox, contact Pau Spam.  Plus, they are really nice guys.

As I was moving my server over from my previous VPS, I realized how many little services or plugins I use.  Setup was a bit slower than I had anticipated, but in the end, fairly quick and easy.

So, a quick list of some of the services that I use:

EditDNS - for hosting my DNS.  I could do this myself, but they have a good interface and run multiple DNS servers.  And it's free.
PacNames - DNS name registration.  I don't know why I use them, I just always have.
Ubuntu Server - Ubuntu.  Server.  E…

links for 2007-09-08

Task-Centered User Interface Design : Main page and Shareware Notice
The central goal of this book is to teach the reader how to design user interfaces that will enable people to learn computer systems quickly and use them effectively, efficiently, and comfortably.
(tags: uiusability)

links for 2007-09-05

links for 2007-09-04