Avoiding data processing with hashes

Recently for a project I am working on we had to find a way to decide if a data record had changed since the last time we saved it to the database. For this project we are receiving data from a number of customer databases periodically (every 6 hours) and need to do some ETL stuff on it before storing it in an Oracle database. We all know that reading a record from Oracle can be slow (relatively speaking), so we wanted something much faster.

Since there isn’t a reasonable way to determine from the databases themselves which records have changed we are forced to dump entire tables for processing. We wanted to do the minimal amount of processing necessary if the records haven’t changed since we wrote them to the database to save processing time. Typically for most of these tables only a very small number of rows have actually changed.

The solution we came up with was to calculate a hash value from the concatenated column values for each record when we write a record to Oracle. These hash values are then stored in a Redis hash keyed by the primary key of the table. Then when we receive a data table dump from the remote database we calculate the same hash on the incoming record and compare it to the hash value we have stored in Redis. If they match we know the record hasn’t changed and we don’t need to process it.

The whole thing is very fast and saves a great deal of unnecessary processing downstream. So far Redis has proven up to the task of quickly storing and retrieving millions of key/hash values.

Here’s what it looks like:

Let’s say we have a customer record in our database that has id, name, address, city and state like this:

id: 1000, name: Joe Smith, address: 123 Main St., city: New York, state: NY

the first time we see this record and check Redis for the existence of the key 1000 we will get nil returned, so we store the record in the database, calculate the hash of the value "1000:Joe Smith:123 Main St.:New York:NY" to get something like "0a243fec5ab" (we are using 128-bit hash values in production) and then store that hash value in Redis using

set "1000" "0a243fec5ab"

the next time we get this record we calculate the same hash if it is unchanged and know we don’t need to process it again. If the hash value is different then we know that the record has changed and we need to send it through the rest of the processing chain.

We are currently using Storm to process millions of records at a time and use this hash check very early on in the pipeline to save us from needing more resources later in the pipeline to process unchanged records. We chose to use the Murmur-128 hash that is included in the Google Guava library for its speed.

Have you ever had a similar requirement? How did you solve it?

Amazon SQS vs. RabbitMQ

Recently at work we’ve been working on selecting a message queue system for a new project we’re working on. The three that we looked at were ActiveMQ, Amazon SQS and RabbitMQ. We’ll be running the applications on Amazon’s EC2, so SQS seemed like just the thing. Unfortunately, it performs so poorly that it wasn’t even a contest in the end.

We threw out ActiveMQ pretty early before even running it through the performance tests due to the horrible clustering support. It seems to be geared more toward load balancing instead of high availability and fast failure recovery. It’s described here. We really didn’t think it met our HA requirements.

So I set out to do a performance test between Amazon SQS and RabbitMQ to see if they were close. I was really hoping that SQS would work since the setup and maintenance time would be next to nothing. We setup two EC2 small instances and installed RabbitMQ on one. The other would be used to run the Java program that ran the tests.

The idea was that when the program started up it would create 2 threads. One thread would publish 10,000 messages to the queue as fast as it could one after the other. The other thread would consume the 10,000 message from the queue one after the other as fast as it could. The results were astonishing. Running against SQS took a little over 6 minutes and RabbitMQ took 12 seconds. I suspected that RabbitMQ would be faster, but not that much faster. Our decision was easy.

I realize this isn’t a scientific benchmark, but the vast difference in results speaks for itself and a carefully constructed experiment isn’t needed. I uploaded the code I used to Github: https://github.com/dkincaid/MQBenchmark in case you’re interested in trying it out. Please understand this was a very quick program thrown together quickly for a single purpose. I fully expected to throw it away when I was done.

Strangeloop 2011 Day 2

I’m headed back home from Strangeloop 2011 this morning. Once again I booked an early flight so was up at 4:45 to get to the airport (when will I learn?) The conference was a smashing success as far as I am concerned. It was extremely well run and the talks were full of solid content. I didn’t see nearly as much marketing during the conference as I’ve seen at other conferences which was really nice. Most of the marketing I did see was companies trying to recruit new developers. There seems to be a lot of demand out there right now for innovative thinkers and people who are eager to stay on the cutting edge. Makes me think…

I started the day with a talk by Jake Luciani called “Hadoop and Cassandra”. Basically this was an introduction to a tool called Brisk which helps take some of the pain out of bringing up Hadoop clusters and running MapReduce jobs. In essence it embeds the components of Hadoop inside Cassandra and makes it easy to deploy and easy to scale with no downtime. It replaces HDFS with CassandraFS which in an of itself looks really interesting. It’s turning the Cassandra DB into a distributed file system. Very interesting how they are doing that. Sounds like a topic for another post once I’ve had some time to read some more about it. Jake showed a demo that looked quite impressive as he brought up a four cluster Hadoop on Cassandra node and ran a portfolio manager application splitting it into an OLTP side and an OLAP side. Brisk definitely deserves further investigation.

Continue reading

Strangeloop 2011 – Day 1 Debrief

I’m at Strangeloop 2011 in St. Louis. The workshops were yesterday, so I’m calling that day 0. I don’t really have much to say about the workshops other than I should have chosen different ones. I chose the two Clojure workshops to try to learn more about that language. I’ve been working with Clojure on and off for the last 6 months, but don’t feel I’ve really grasped the fundamentals of the language or how to think in it. While I did get a few new things from Stuart Sierra’s part 1 (Introduction to Clojure), I probably would have gotten more out of a different workshop. Stuart did a fantastic job presenting and it was an intro workshop, so it is completely my fault.

The second workshop was Aaron Bedra’s “Building Analytics with Clojure”. This wasn’t really about analytics at all unless your idea of analytics is making scatterplots and bar charts from a data set. I was expecting to learn much more about Incanter and how it can be used in similar ways to R. I must have misunderstood the topic of the workshop. I should have gone to the Cascalog workshop.

Today was much, much better. I came up to the room after lunch for a bit and was thinking that I had already gotten my money’s worth out of the first half day. I went to some amazing talks.

I haven’t been to many developer/tech conferences, so I don’t really have much to compare this to. I was at O’Reilly’s Strataconf in February and was a bit disappointed in the amount of actual content contained in most of the talks. The keynotes there were 15 minutes and most were sales pitches for the various sponsors. The talks here are nothing but great content. The team did a fantastic job lining up a great set of talks and I’m learning a ton.

Continue reading

Getting data from the Infochimps Geo API in R

I am very intrigued by the Infochimps Geo API, so wanted to play around with it a little bit and pull the data into R. I’ll start by getting data from the American Community Survey Topline API for a 10km area around where I live.

First some setup code here. It imports a couple libraries that we’ll need (RJSONIO and ggplot2) then sets up some variables that we’ll use later to construct the REST call into Infochimps Geo.

library(RJSONIO)
library(ggplot2)

api.uri <- "http://api.infochimps.com/"
acs.topline <- "social/demographics/us_census/topline/search?"
api.key <- "apikey=xxxxxxxxxx"

radius <- 10000  # in meters
lat <- 44.768202
long <- -91.491603

columns <- c("geography_name","median_household_income",
"median_housing_value", "avg_household_size") 

Note: if you want to use this code you’ll need to remove the x’s in the api.key and replace it with your Infochimps API key.

Continue reading