“Big Data” has become what dot com used to be in the late
nineties. Anyone having a cute little elephant as their mascot is getting “big
bucks”. Yes, Hadoop is the buzz word around town.
At a high level, Hadoop allows you to distribute data across
multiple nodes so that you can process it in parallel. And that leads us to the
other buzz word (or two) – MapReduce. It seems intuitive that because of the
distribution, you can quickly “look” for data in parallel and once you find it,
you can then combine it to do something useful with it. First part is Map and
the second part is Reduce.
It also seems intuitive that no matter how much you
distribute Map, Reduce has to be a linear process. In fact, MapReduce, if used incorrectly, can be a very slow process which is
made fast, by throwing a bunch of nodes processing the data in parallel. Now,
to apply this philosophy to Machine Log Analytic applications, I have 2
problems with it:
1.
The Map process essentially looks for data based
on some rules (regex’s or such). Even if it is distributed, it is an expensive
operation. For a given analytic query, why repeat the same expensive operation
over and over again? It makes sense to preserve the rules for looking for data,
and persist the already looked data.
2.
Why have a slow process in the overall
architecture at all? Wouldn’t it be nice to Map but not Reduce?
The solution lies in having a Domain Specific Language (DSL)
to allow defining the rules easily. Not only that, make it more strict (more
like a language) than mere configurational parameters so that even data
accuracy assertions can be inherently built into it.
Use something like an Akka Actor framework which distributes
seamlessly across nodes. If this sounds very much like Map – it is.
Make this an asynchronous peer to peer framework, with no
master/slave relationship. Empower the actors to do their job and not have to
“report” back. Guess what, you just eliminated the linear and time consuming part
of MapReduce. You have Map and NO REDUCE. The second benefit of “No Reduce” is
that this framework can now scale to a high number of nodes, limited only by
the physical limitations of cluster sizes and interconnects.
The final piece of the puzzle is to persist data already
looked. Use a data store like Cassandra which has no master/slave relationship
and have the actors directly deposit the “looked” data into Cassandra. Since
Cassandra is a peer to peer cluster, asynchronous actors can deposit data
asynchronously to nodes visible to them.
Glassbeam solution achieves all three. It provides a robust
DSL, a highly scalable actor framework and a Cassandra based data store which
contains pre-parsed data as well as raw data for subsequent incremental
processing.
Stay tuned for more on incremental processing …..
No comments:
Post a Comment