Cascalog Performance Tuning (or avoiding Java reflection at all costs)

Recently I’ve been working on a project to both learn and do a proof of concept using Hadoop for some data processing. The most I’ve done with Hadoop until this was a couple tutorials using word count examples. This project called for much more processing and the prospect of trying to figure out how to do that in straight MapReduce was daunting.

I was aware of Cascading and since all of our work is done in Java these days I decided to give that a try. So the first prototype was developed using Cascading and proved fairly successful. The code was kind of a mess though as Java code of any size tends to be.  Since I have done a little bit of programming in Clojure over the last year or so I knew of Cascalog so decided to give it a try and see how it would compare with just doing it ourselves in Cascading.

Let me summarize the project first then I’ll share what I learned in the process.

The project is to take data in the form of pipe delimited text files that come in every night from about 2,000 individual stores, compute some aggregated stats and populate an Oracle database table with the results. Each line in the text files is an individual item sold at the store and the files contain all sales at the store for the last seven years or so. The Hadoop job needs to run over all the transactions in all of the files to calculate the totals. (I realize there are better ways to do this by only getting and processing updated data each day, but this is more of a learning project and evaluation of Hadoop, Cascading and Cascalog). All in all there are about 480 million lines in all of the files.

Continue reading

Advertisements

Strangeloop 2011 Day 2

I’m headed back home from Strangeloop 2011 this morning. Once again I booked an early flight so was up at 4:45 to get to the airport (when will I learn?) The conference was a smashing success as far as I am concerned. It was extremely well run and the talks were full of solid content. I didn’t see nearly as much marketing during the conference as I’ve seen at other conferences which was really nice. Most of the marketing I did see was companies trying to recruit new developers. There seems to be a lot of demand out there right now for innovative thinkers and people who are eager to stay on the cutting edge. Makes me think…

I started the day with a talk by Jake Luciani called “Hadoop and Cassandra”. Basically this was an introduction to a tool called Brisk which helps take some of the pain out of bringing up Hadoop clusters and running MapReduce jobs. In essence it embeds the components of Hadoop inside Cassandra and makes it easy to deploy and easy to scale with no downtime. It replaces HDFS with CassandraFS which in an of itself looks really interesting. It’s turning the Cassandra DB into a distributed file system. Very interesting how they are doing that. Sounds like a topic for another post once I’ve had some time to read some more about it. Jake showed a demo that looked quite impressive as he brought up a four cluster Hadoop on Cassandra node and ran a portfolio manager application splitting it into an OLTP side and an OLAP side. Brisk definitely deserves further investigation.

Continue reading

Strangeloop 2011 – Day 1 Debrief

I’m at Strangeloop 2011 in St. Louis. The workshops were yesterday, so I’m calling that day 0. I don’t really have much to say about the workshops other than I should have chosen different ones. I chose the two Clojure workshops to try to learn more about that language. I’ve been working with Clojure on and off for the last 6 months, but don’t feel I’ve really grasped the fundamentals of the language or how to think in it. While I did get a few new things from Stuart Sierra’s part 1 (Introduction to Clojure), I probably would have gotten more out of a different workshop. Stuart did a fantastic job presenting and it was an intro workshop, so it is completely my fault.

The second workshop was Aaron Bedra’s “Building Analytics with Clojure”. This wasn’t really about analytics at all unless your idea of analytics is making scatterplots and bar charts from a data set. I was expecting to learn much more about Incanter and how it can be used in similar ways to R. I must have misunderstood the topic of the workshop. I should have gone to the Cascalog workshop.

Today was much, much better. I came up to the room after lunch for a bit and was thinking that I had already gotten my money’s worth out of the first half day. I went to some amazing talks.

I haven’t been to many developer/tech conferences, so I don’t really have much to compare this to. I was at O’Reilly’s Strataconf in February and was a bit disappointed in the amount of actual content contained in most of the talks. The keynotes there were 15 minutes and most were sales pitches for the various sponsors. The talks here are nothing but great content. The team did a fantastic job lining up a great set of talks and I’m learning a ton.

Continue reading