Hadoop Disillusionment

It’s been quite a while since I last blogged. I’d like to say I’ll get more consistent, but things are so busy it’s hard to find the time. I felt the need to write something today both to “get it off my chest” as they say as well as to maybe help some others that might be starting into Hadoop avoid some misunderstandings.

The title of this post refers to my own disillusionment, not to the cliché “trough of disillusionment” of Gartner, et. al. I’ve been on the fringes of the Hadoop world for several years. I attended the first O’Reilly Strata Conference back in 2011 and I’ve read and read and read blog posts, watched many talks and tutorials, etc. I even have been working a bit with a production job that runs weekly on Amazon’s EMR service. But I’ve never really had to do a full scale project that relied on Hadoop as its foundation. So I’ve developed some misunderstandings about how things work and having those bubbles popped as I launch a Hadoop project.

Recently I pitched an idea for a project that I’ve been designing in my head for a couple years and got it approved and funded. So I finally had my big data project that I could sink my teeth into and do a full fledged Hadoop implementation. Up until this point EMR and Cascalog/Cascading had insulated me from the plumbing and details of Hadoop itself. I’m working with a really sharp team of four. We’re all Hadoop newbies, so we’re all climbing that learning curve together.

The old saying “be careful what you wish for” has hit me square in the face. I tweeted out some comments over the last week or so that were probably pretty unfair to MapR in particular. I’ve come to see that the shortcomings I was complaining about are shortcomings in the Hadoop platform itself. It’s not something that MapR has done. I’ve come to see that they have added a number of simplifications and created sane defaults where Hadoop itself has missed the mark. In fact, MapR has reached out to us and is actively helping us get things working.

So, since 140 characters at a time isn’t enough to get my point out and has caused more misunderstanding than it’s helped I decided to spell things out in a longer form and maybe even add my voice to others. It’s going to take a lot of us to get the Hadoop ship turned.

Over the next few blog posts I intend to take you on the journey as I go from Hadoop neophyte to disillusioned newbie. Along the way I welcome corrections where I may be wrong or off the mark. Watch for the first post within the next day on my top four current complaints:

  1. Out of control configuration options (aka XML sucks)
  2. Inability to do development in Windows environment (unfortunately every isn’t on Linux or Mac yet)
  3. Reliance on shell scripts for everything (we’re writing Java apps, not Bash scripts)
  4. Out of date and incomplete documentation (what’s out there is all the same and misses some crucial things)
  5. A really, really nasty looking code base (100+ line methods, shell out to OS, oh my God!)

There, that should be provocative enough to get you to come back… Some other things related to this project that I may or may not blog about are why I think AWS EMR is not a serious platform for more than a one-off job here and there and our experiences implementing ideas from Nathan Marz’ incomplete book “Big Data”.

Cascalog Performance Tuning (or avoiding Java reflection at all costs)

Recently I’ve been working on a project to both learn and do a proof of concept using Hadoop for some data processing. The most I’ve done with Hadoop until this was a couple tutorials using word count examples. This project called for much more processing and the prospect of trying to figure out how to do that in straight MapReduce was daunting.

I was aware of Cascading and since all of our work is done in Java these days I decided to give that a try. So the first prototype was developed using Cascading and proved fairly successful. The code was kind of a mess though as Java code of any size tends to be.  Since I have done a little bit of programming in Clojure over the last year or so I knew of Cascalog so decided to give it a try and see how it would compare with just doing it ourselves in Cascading.

Let me summarize the project first then I’ll share what I learned in the process.

The project is to take data in the form of pipe delimited text files that come in every night from about 2,000 individual stores, compute some aggregated stats and populate an Oracle database table with the results. Each line in the text files is an individual item sold at the store and the files contain all sales at the store for the last seven years or so. The Hadoop job needs to run over all the transactions in all of the files to calculate the totals. (I realize there are better ways to do this by only getting and processing updated data each day, but this is more of a learning project and evaluation of Hadoop, Cascading and Cascalog). All in all there are about 480 million lines in all of the files.

Continue reading

Avoiding data processing with hashes

Recently for a project I am working on we had to find a way to decide if a data record had changed since the last time we saved it to the database. For this project we are receiving data from a number of customer databases periodically (every 6 hours) and need to do some ETL stuff on it before storing it in an Oracle database. We all know that reading a record from Oracle can be slow (relatively speaking), so we wanted something much faster.

Since there isn’t a reasonable way to determine from the databases themselves which records have changed we are forced to dump entire tables for processing. We wanted to do the minimal amount of processing necessary if the records haven’t changed since we wrote them to the database to save processing time. Typically for most of these tables only a very small number of rows have actually changed.

The solution we came up with was to calculate a hash value from the concatenated column values for each record when we write a record to Oracle. These hash values are then stored in a Redis hash keyed by the primary key of the table. Then when we receive a data table dump from the remote database we calculate the same hash on the incoming record and compare it to the hash value we have stored in Redis. If they match we know the record hasn’t changed and we don’t need to process it.

The whole thing is very fast and saves a great deal of unnecessary processing downstream. So far Redis has proven up to the task of quickly storing and retrieving millions of key/hash values.

Here’s what it looks like:

Let’s say we have a customer record in our database that has id, name, address, city and state like this:

id: 1000, name: Joe Smith, address: 123 Main St., city: New York, state: NY

the first time we see this record and check Redis for the existence of the key 1000 we will get nil returned, so we store the record in the database, calculate the hash of the value "1000:Joe Smith:123 Main St.:New York:NY" to get something like "0a243fec5ab" (we are using 128-bit hash values in production) and then store that hash value in Redis using

set "1000" "0a243fec5ab"

the next time we get this record we calculate the same hash if it is unchanged and know we don’t need to process it again. If the hash value is different then we know that the record has changed and we need to send it through the rest of the processing chain.

We are currently using Storm to process millions of records at a time and use this hash check very early on in the pipeline to save us from needing more resources later in the pipeline to process unchanged records. We chose to use the Murmur-128 hash that is included in the Google Guava library for its speed.

Have you ever had a similar requirement? How did you solve it?

Amazon SQS vs. RabbitMQ

Recently at work we’ve been working on selecting a message queue system for a new project we’re working on. The three that we looked at were ActiveMQ, Amazon SQS and RabbitMQ. We’ll be running the applications on Amazon’s EC2, so SQS seemed like just the thing. Unfortunately, it performs so poorly that it wasn’t even a contest in the end.

We threw out ActiveMQ pretty early before even running it through the performance tests due to the horrible clustering support. It seems to be geared more toward load balancing instead of high availability and fast failure recovery. It’s described here. We really didn’t think it met our HA requirements.

So I set out to do a performance test between Amazon SQS and RabbitMQ to see if they were close. I was really hoping that SQS would work since the setup and maintenance time would be next to nothing. We setup two EC2 small instances and installed RabbitMQ on one. The other would be used to run the Java program that ran the tests.

The idea was that when the program started up it would create 2 threads. One thread would publish 10,000 messages to the queue as fast as it could one after the other. The other thread would consume the 10,000 message from the queue one after the other as fast as it could. The results were astonishing. Running against SQS took a little over 6 minutes and RabbitMQ took 12 seconds. I suspected that RabbitMQ would be faster, but not that much faster. Our decision was easy.

I realize this isn’t a scientific benchmark, but the vast difference in results speaks for itself and a carefully constructed experiment isn’t needed. I uploaded the code I used to Github: https://github.com/dkincaid/MQBenchmark in case you’re interested in trying it out. Please understand this was a very quick program thrown together quickly for a single purpose. I fully expected to throw it away when I was done.

Strangeloop 2011 Day 2

I’m headed back home from Strangeloop 2011 this morning. Once again I booked an early flight so was up at 4:45 to get to the airport (when will I learn?) The conference was a smashing success as far as I am concerned. It was extremely well run and the talks were full of solid content. I didn’t see nearly as much marketing during the conference as I’ve seen at other conferences which was really nice. Most of the marketing I did see was companies trying to recruit new developers. There seems to be a lot of demand out there right now for innovative thinkers and people who are eager to stay on the cutting edge. Makes me think…

I started the day with a talk by Jake Luciani called “Hadoop and Cassandra”. Basically this was an introduction to a tool called Brisk which helps take some of the pain out of bringing up Hadoop clusters and running MapReduce jobs. In essence it embeds the components of Hadoop inside Cassandra and makes it easy to deploy and easy to scale with no downtime. It replaces HDFS with CassandraFS which in an of itself looks really interesting. It’s turning the Cassandra DB into a distributed file system. Very interesting how they are doing that. Sounds like a topic for another post once I’ve had some time to read some more about it. Jake showed a demo that looked quite impressive as he brought up a four cluster Hadoop on Cassandra node and ran a portfolio manager application splitting it into an OLTP side and an OLAP side. Brisk definitely deserves further investigation.

Continue reading

Strangeloop 2011 – Day 1 Debrief

I’m at Strangeloop 2011 in St. Louis. The workshops were yesterday, so I’m calling that day 0. I don’t really have much to say about the workshops other than I should have chosen different ones. I chose the two Clojure workshops to try to learn more about that language. I’ve been working with Clojure on and off for the last 6 months, but don’t feel I’ve really grasped the fundamentals of the language or how to think in it. While I did get a few new things from Stuart Sierra’s part 1 (Introduction to Clojure), I probably would have gotten more out of a different workshop. Stuart did a fantastic job presenting and it was an intro workshop, so it is completely my fault.

The second workshop was Aaron Bedra’s “Building Analytics with Clojure”. This wasn’t really about analytics at all unless your idea of analytics is making scatterplots and bar charts from a data set. I was expecting to learn much more about Incanter and how it can be used in similar ways to R. I must have misunderstood the topic of the workshop. I should have gone to the Cascalog workshop.

Today was much, much better. I came up to the room after lunch for a bit and was thinking that I had already gotten my money’s worth out of the first half day. I went to some amazing talks.

I haven’t been to many developer/tech conferences, so I don’t really have much to compare this to. I was at O’Reilly’s Strataconf in February and was a bit disappointed in the amount of actual content contained in most of the talks. The keynotes there were 15 minutes and most were sales pitches for the various sponsors. The talks here are nothing but great content. The team did a fantastic job lining up a great set of talks and I’m learning a ton.

Continue reading

Getting data from the Infochimps Geo API in R

I am very intrigued by the Infochimps Geo API, so wanted to play around with it a little bit and pull the data into R. I’ll start by getting data from the American Community Survey Topline API for a 10km area around where I live.

First some setup code here. It imports a couple libraries that we’ll need (RJSONIO and ggplot2) then sets up some variables that we’ll use later to construct the REST call into Infochimps Geo.

library(RJSONIO)
library(ggplot2)

api.uri <- "http://api.infochimps.com/"
acs.topline <- "social/demographics/us_census/topline/search?"
api.key <- "apikey=xxxxxxxxxx"

radius <- 10000  # in meters
lat <- 44.768202
long <- -91.491603

columns <- c("geography_name","median_household_income",
"median_housing_value", "avg_household_size") 

Note: if you want to use this code you’ll need to remove the x’s in the api.key and replace it with your Infochimps API key.

Continue reading