“Big Data” has become what dot com used to be in the late nineties. Anyone having a cute little elephant as their mascot is getting “big bucks”. Yes, Hadoop is the buzz word around town.
At a high level, Hadoop allows you to distribute data across multiple nodes so that you can process it in parallel. And that leads us to the other buzz word (or two) – MapReduce. It seems intuitive that because of the distribution, you can quickly “look” for data in parallel and once you find it, you can then combine it to do something useful with it. First part is Map and the second part is Reduce.
It also seems intuitive that no matter how much you distribute Map, Reduce has to be a linear process. In fact, MapReduce, if used incorrectly, can be a very slow process which is made fast, by throwing a bunch of nodes processing the data in parallel. Now, to apply this philosophy to Machine Log Analytic applications, I have 2 problems with it:
1. The Map process essentially looks for data based on some rules (regex’s or such). Even if it is distributed, it is an expensive operation. For a given analytic query, why repeat the same expensive operation over and over again? It makes sense to preserve the rules for looking for data, and persist the already looked data.
2. Why have a slow process in the overall architecture at all? Wouldn’t it be nice to Map but not Reduce?
The solution lies in having a Domain Specific Language (DSL) to allow defining the rules easily. Not only that, make it more strict (more like a language) than mere configurational parameters so that even data accuracy assertions can be inherently built into it.
Use something like an Akka Actor framework which distributes seamlessly across nodes. If this sounds very much like Map – it is.
Make this an asynchronous peer to peer framework, with no master/slave relationship. Empower the actors to do their job and not have to “report” back. Guess what, you just eliminated the linear and time consuming part of MapReduce. You have Map and NO REDUCE. The second benefit of “No Reduce” is that this framework can now scale to a high number of nodes, limited only by the physical limitations of cluster sizes and interconnects.
The final piece of the puzzle is to persist data already looked. Use a data store like Cassandra which has no master/slave relationship and have the actors directly deposit the “looked” data into Cassandra. Since Cassandra is a peer to peer cluster, asynchronous actors can deposit data asynchronously to nodes visible to them.
Glassbeam solution achieves all three. It provides a robust DSL, a highly scalable actor framework and a Cassandra based data store which contains pre-parsed data as well as raw data for subsequent incremental processing.
Stay tuned for more on incremental processing …..