Thursday, May 23, 2013

To Search or Not to Search

Search as the starting point is a great way to start any analytics with Machine data. As a user, initially you don't know what you are searching for and hence searching for "needle in a hay stack" is easy, because all you need to do is type needle! Yes, you will get a lot of results back which then needs to be filtered/ranked and presented in a meaningful way, but open source search engines, that allow full text search of any document like SOLR/Lucene, provide a good starting point for search implementation. With data indexed, a lot of functionality can be built on top of it to enhance the search ability for the user.

While search is good as a starting point, it tends to get more geeky for the end user when complex search queries need to be created. Even if you manage to write complex queries, everything driven from search alone means a huge load on the search server. Also, the power of full text search diminishes when you want to perform operations similar to those on structured data.

Consider a simple example of a device log that has CPU load information captured as a table through vmstat command of Linux like below. This is just one of the sections along with whole lot of other information in the log

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
8  1 1865144 1018628  56364 1677296   10   16    35   161   97   92  4  0 2  1  0
2  2 1865144 1018628  56364 1677296    0    0     0     4 1007  395  0  0 30  0  0
4  2 1865144 1018628  56376 1677296    0    0     0    92 1016  421  0  0 10  0  0
4  2 1865144 1018628  56376 1677296    0    0     0     0 1008  363  0  0 10  0  0
6  3 1865144 1018628  56376 1677316    0    0     0     0 1003  361  0  0 5  0  0
3  1 1865144 1018628  56384 1677308    0    0     0    20 1006  341  0  0 10  0  0
1  1 1865144 1018628  56384 1677332    0    0     0     8 1009  364  0  0 50  0  0
0  0 1865144 1018628  56396 1677320    0    0     0    92 1019  448  0  0 99  0  0

Let's say we need to plot r, b over time and also look at the CPU utilization(id) when the values of r or b > 2.

A requirement like this is complex, if not impossible when approaching the problem through search. Let's assume that this data was parsed and the data above was available in a database like table. Now, performing analysis as listed above would be relatively simpler. Also, the user can be provided with a simple user interface to browse through the attributes which has been extracted and allow the user to use drag and drop type of interface to slice and dice data. This approach though requires upfront data extraction which has to be persisted in a data store and a simple framework to do it efficiently.

The question hence is not about analytics through search or analytics through pre-parsed content. It is about using the right method based on the type of content, so that effective results can be seen with minimal cost, time and resources. The question then is why not both?. When the data is both parsed and Indexed appropriately, it opens up avenues that are not possible with either approach alone. For example, all the parsed content can be used to add useful context to what is being searched and search as an interface can be used to get to the content in the simplest and quickest way. You are no longer searching for a pattern, but you are searching for a pattern within a well-defined context. For ex: Consider a search term like this, 'Error code 5085" where Software version = 5.4 or 5.6, Model = 9877, License = NFS, Language Setting = US-EN. Assume that each of the context variable being passed to search comes from some section in the log, where the attribute of that section can have multiple values. Also, Pre-parsed content can be transformed or pre-aggregated to increase end user response time for queries being executed. A combined approach hence could provide lot more options than an either/or approach.

At Glassbeam, we are constantly looking at providing practical solutions to Big data problems. The above blog is just one such use case

Social Responsibility begins at home.... and office too!

Vishwanath works as a front desk/security supervisor at our Glassbeam India office. For those of us who work in India would know that the salary he earns is barely enough to meet basic living needs. He has a daughter who finished her 10th standard and got her results a couple of days back. She cleared the exams with flying colours, scoring 88.5% overall with - 98 in math, 98 in Social science, 94 in English, 100 in Kannada (out of 125), 86 in Hindi and 77 in General science. While it is an amazing achievement on its own, the following factors make it even better.

1.   Her background - For many of us who come from a background of financial stability or well educated parents/relatives/siblings, its less difficult to score well than to those children who lack the basic background they need to do well. Kids who do well in such scenario deserve a multi-fold applause for their achievement
2.   She studied in Kannada medium till 8th standard - She moved to English medium only from 9th, which means that not only she learnt English in just 2 years ( She scored 94 in English), but she also learnt all other subjects in English and fared really well too. This is not a small achievement by any means. For some of us here who have studied in our mother tongue for most part of our schooling and switched to English medium of learning later, would know that this is indeed commendable.

She was planning to join one of the top colleges in Bangalore for her pre-university. The fees for the college was Rs 30,000 + other expenses for books etc. Vishwanath had managed to arrange for Rs 10,000 and didn't know what to do for the rest of the money. It was heart-warming to see how quickly the whole team contributed the remaining amount, as soon as they got to know about it, to help her get admitted to the college. She joined the college of her choice and wants to study Engineering once she completes her 2nd PU ( 12th Standard in Karnataka). She is really happy and excited about her college. Vishwanath is happy that the next generation in his family will be much better educated and financially stable than him.

Social responsibility doesn't have to always mean doing something outside. It can start right where you work. Social responsibility in simple words is about being sensitive to the issues around you and taking some action to find solutions or mitigate those issues. For those of us who have crossed the barrier and stabilized ourselves socially and economically, it is a simple action, to help others around us cross.

Glassbeam recognized by CRN and TiECON

Last week was an eventful week for Glassbeam.  We were recognized by the industry with two distinct awards.  First was the nomination to CRN Big Data 100 list.  Second was the selection at TieCon 2013 to be one of the winners of the Big Data Lightning Round.  Srikanth Desikan, our fearless VP Products & Marketing, presented the Glassbeam story on machine data analytics in a truly lightning style in less than 3 minutes!  We will be posting the video of this presentation soon on our website.

Getting these awards is a significant validation for us as a company, and most importantly a great recognition for the hard work being put in by the team across US and India time zones.  There are some real interesting updates coming up on our customer wins, product roadmap and outbound events in coming weeks.  So stay tuned and enjoy your long Memorial day weekend!

Tuesday, May 7, 2013

To Reduce or not to Reduce

“Big Data” has become what dot com used to be in the late nineties. Anyone having a cute little elephant as their mascot is getting “big bucks”. Yes, Hadoop is the buzz word around town.

At a high level, Hadoop allows you to distribute data across multiple nodes so that you can process it in parallel. And that leads us to the other buzz word (or two) – MapReduce. It seems intuitive that because of the distribution, you can quickly “look” for data in parallel and once you find it, you can then combine it to do something useful with it. First part is Map and the second part is Reduce.

It also seems intuitive that no matter how much you distribute Map, Reduce has to be a linear process. In fact, MapReduce, if used incorrectly,  can be a very slow process which is made fast, by throwing a bunch of nodes processing the data in parallel. Now, to apply this philosophy to Machine Log Analytic applications, I have 2 problems with it:

1.      The Map process essentially looks for data based on some rules (regex’s or such). Even if it is distributed, it is an expensive operation. For a given analytic query, why repeat the same expensive operation over and over again? It makes sense to preserve the rules for looking for data, and persist the already looked data.
2.      Why have a slow process in the overall architecture at all? Wouldn’t it be nice to Map but not Reduce?

The solution lies in having a Domain Specific Language (DSL) to allow defining the rules easily. Not only that, make it more strict (more like a language) than mere configurational parameters so that even data accuracy assertions can be inherently built into it.

Use something like an Akka Actor framework which distributes seamlessly across nodes. If this sounds very much like Map – it is.

Make this an asynchronous peer to peer framework, with no master/slave relationship. Empower the actors to do their job and not have to “report” back. Guess what, you just eliminated the linear and time consuming part of MapReduce. You have Map and NO REDUCE. The second benefit of “No Reduce” is that this framework can now scale to a high number of nodes, limited only by the physical limitations of cluster sizes and interconnects.

The final piece of the puzzle is to persist data already looked. Use a data store like Cassandra which has no master/slave relationship and have the actors directly deposit the “looked” data into Cassandra. Since Cassandra is a peer to peer cluster, asynchronous actors can deposit data asynchronously to nodes visible to them.

Glassbeam solution achieves all three. It provides a robust DSL, a highly scalable actor framework and a Cassandra based data store which contains pre-parsed data as well as raw data for subsequent incremental processing.

Stay tuned for more on incremental processing …..