Hadoop Disillusionment

It’s been quite a while since I last blogged. I’d like to say I’ll get more consistent, but things are so busy it’s hard to find the time. I felt the need to write something today both to “get it off my chest” as they say as well as to maybe help some others that might be starting into Hadoop avoid some misunderstandings.

The title of this post refers to my own disillusionment, not to the clich√© “trough of disillusionment” of Gartner, et. al. I’ve been on the fringes of the Hadoop world for several years. I attended the first O’Reilly Strata Conference back in 2011 and I’ve read and read and read blog posts, watched many talks and tutorials, etc. I even have been working a bit with a production job that runs weekly on Amazon’s EMR service. But I’ve never really had to do a full scale project that relied on Hadoop as its foundation. So I’ve developed some misunderstandings about how things work and having those bubbles popped as I launch a Hadoop project.

Recently I pitched an idea for a project that I’ve been designing in my head for a couple years and got it approved and funded. So I finally had my big data project that I could sink my teeth into and do a full fledged Hadoop implementation. Up until this point EMR and Cascalog/Cascading had insulated me from the plumbing and details of Hadoop itself. I’m working with a really sharp team of four. We’re all Hadoop newbies, so we’re all climbing that learning curve together.

The old saying “be careful what you wish for” has hit me square in the face. I tweeted out some comments over the last week or so that were probably pretty unfair to MapR in particular. I’ve come to see that the shortcomings I was complaining about are shortcomings in the Hadoop platform itself. It’s not something that MapR has done. I’ve come to see that they have added a number of simplifications and created sane defaults where Hadoop itself has missed the mark. In fact, MapR has reached out to us and is actively helping us get things working.

So, since 140 characters at a time isn’t enough to get my point out and has caused more misunderstanding than it’s helped I decided to spell things out in a longer form and maybe even add my voice to others. It’s going to take a lot of us to get the Hadoop ship turned.

Over the next few blog posts I intend to take you on the journey as I go from Hadoop neophyte to disillusioned newbie. Along the way I welcome corrections where I may be wrong or off the mark. Watch for the first post within the next day on my top four current complaints:

  1. Out of control configuration options (aka XML sucks)
  2. Inability to do development in Windows environment (unfortunately every isn’t on Linux or Mac yet)
  3. Reliance on shell scripts for everything (we’re writing Java apps, not Bash scripts)
  4. Out of date and incomplete documentation (what’s out there is all the same and misses some crucial things)
  5. A really, really nasty looking code base (100+ line methods, shell out to OS, oh my God!)

There, that should be provocative enough to get you to come back… Some other things related to this project that I may or may not blog about are why I think AWS EMR is not a serious platform for more than a one-off job here and there and our experiences implementing ideas from Nathan Marz’ incomplete book “Big Data”.


Cascalog Performance Tuning (or avoiding Java reflection at all costs)

Recently I’ve been working on a project to both learn and do a proof of concept using Hadoop for some data processing. The most I’ve done with Hadoop until this was a couple tutorials using word count examples. This project called for much more processing and the prospect of trying to figure out how to do that in straight MapReduce was daunting.

I was aware of Cascading and since all of our work is done in Java these days I decided to give that a try. So the first prototype was developed using Cascading and proved fairly successful. The code was kind of a mess though as Java code of any size tends to be.  Since I have done a little bit of programming in Clojure over the last year or so I knew of Cascalog so decided to give it a try and see how it would compare with just doing it ourselves in Cascading.

Let me summarize the project first then I’ll share what I learned in the process.

The project is to take data in the form of pipe delimited text files that come in every night from about 2,000 individual stores, compute some aggregated stats and populate an Oracle database table with the results. Each line in the text files is an individual item sold at the store and the files contain all sales at the store for the last seven years or so. The Hadoop job needs to run over all the transactions in all of the files to calculate the totals. (I realize there are better ways to do this by only getting and processing updated data each day, but this is more of a learning project and evaluation of Hadoop, Cascading and Cascalog). All in all there are about 480 million lines in all of the files.

Continue reading

Strangeloop 2011 Day 2

I’m headed back home from Strangeloop 2011 this morning. Once again I booked an early flight so was up at 4:45 to get to the airport (when will I learn?) The conference was a smashing success as far as I am concerned. It was extremely well run and the talks were full of solid content. I didn’t see nearly as much marketing during the conference as I’ve seen at other conferences which was really nice. Most of the marketing I did see was companies trying to recruit new developers. There seems to be a lot of demand out there right now for innovative thinkers and people who are eager to stay on the cutting edge. Makes me think…

I started the day with a talk by Jake Luciani called “Hadoop and Cassandra”. Basically this was an introduction to a tool called Brisk which helps take some of the pain out of bringing up Hadoop clusters and running MapReduce jobs. In essence it embeds the components of Hadoop inside Cassandra and makes it easy to deploy and easy to scale with no downtime. It replaces HDFS with CassandraFS which in an of itself looks really interesting. It’s turning the Cassandra DB into a distributed file system. Very interesting how they are doing that. Sounds like a topic for another post once I’ve had some time to read some more about it. Jake showed a demo that looked quite impressive as he brought up a four cluster Hadoop on Cassandra node and ran a portfolio manager application splitting it into an OLTP side and an OLAP side. Brisk definitely deserves further investigation.

Continue reading