Avoiding data processing with hashes

Recently for a project I am working on we had to find a way to decide if a data record had changed since the last time we saved it to the database. For this project we are receiving data from a number of customer databases periodically (every 6 hours) and need to do some ETL stuff on it before storing it in an Oracle database. We all know that reading a record from Oracle can be slow (relatively speaking), so we wanted something much faster.

Since there isn’t a reasonable way to determine from the databases themselves which records have changed we are forced to dump entire tables for processing. We wanted to do the minimal amount of processing necessary if the records haven’t changed since we wrote them to the database to save processing time. Typically for most of these tables only a very small number of rows have actually changed.

The solution we came up with was to calculate a hash value from the concatenated column values for each record when we write a record to Oracle. These hash values are then stored in a Redis hash keyed by the primary key of the table. Then when we receive a data table dump from the remote database we calculate the same hash on the incoming record and compare it to the hash value we have stored in Redis. If they match we know the record hasn’t changed and we don’t need to process it.

The whole thing is very fast and saves a great deal of unnecessary processing downstream. So far Redis has proven up to the task of quickly storing and retrieving millions of key/hash values.

Here’s what it looks like:

Let’s say we have a customer record in our database that has id, name, address, city and state like this:

id: 1000, name: Joe Smith, address: 123 Main St., city: New York, state: NY

the first time we see this record and check Redis for the existence of the key 1000 we will get nil returned, so we store the record in the database, calculate the hash of the value "1000:Joe Smith:123 Main St.:New York:NY" to get something like "0a243fec5ab" (we are using 128-bit hash values in production) and then store that hash value in Redis using

set "1000" "0a243fec5ab"

the next time we get this record we calculate the same hash if it is unchanged and know we don’t need to process it again. If the hash value is different then we know that the record has changed and we need to send it through the rest of the processing chain.

We are currently using Storm to process millions of records at a time and use this hash check very early on in the pipeline to save us from needing more resources later in the pipeline to process unchanged records. We chose to use the Murmur-128 hash that is included in the Google Guava library for its speed.

Have you ever had a similar requirement? How did you solve it?

Strangeloop 2011 Day 2

I’m headed back home from Strangeloop 2011 this morning. Once again I booked an early flight so was up at 4:45 to get to the airport (when will I learn?) The conference was a smashing success as far as I am concerned. It was extremely well run and the talks were full of solid content. I didn’t see nearly as much marketing during the conference as I’ve seen at other conferences which was really nice. Most of the marketing I did see was companies trying to recruit new developers. There seems to be a lot of demand out there right now for innovative thinkers and people who are eager to stay on the cutting edge. Makes me think…

I started the day with a talk by Jake Luciani called “Hadoop and Cassandra”. Basically this was an introduction to a tool called Brisk which helps take some of the pain out of bringing up Hadoop clusters and running MapReduce jobs. In essence it embeds the components of Hadoop inside Cassandra and makes it easy to deploy and easy to scale with no downtime. It replaces HDFS with CassandraFS which in an of itself looks really interesting. It’s turning the Cassandra DB into a distributed file system. Very interesting how they are doing that. Sounds like a topic for another post once I’ve had some time to read some more about it. Jake showed a demo that looked quite impressive as he brought up a four cluster Hadoop on Cassandra node and ran a portfolio manager application splitting it into an OLTP side and an OLAP side. Brisk definitely deserves further investigation.

Continue reading