Tuesday, July 16, 2013

Troubleshooting Troubles! - Part 2

Basic troubleshooting and Automation

In the previous blog, we looked at some of the steps commonly followed during troubleshooting and also how even though the specifics are different the overall approach is similar.

While looking at finding solutions to enable support, we need to remember that all support problems cannot be treated the same way. If we look at support issues, they are typically broken down into different levels, depending on the complexity of the problem. Most organizations have 3 to 4 levels - L1 - L3 or L4.

  • Issues that get resolved at L1/L2 level - These form the majority of the support cases in almost all support organizations. While these issues have lower mean time to resolution (MTTR), the volume of issues makes this the majority of the time that support team spends on
  • Issues that get resolved at L2/L3 or L3/L4 levels (depending on how organizations have structured their team) - While these are fewer in number, the MTTR for such problems are a lot higher.

Both the issues above need a different way of handling. But before we discuss ways of solving this, one of the most important things to remember is that the support organization is always under tremendous pressure to resolve issues in the shortest time possible. A support engineer will always choose a method that helps him get the results in the shortest time and the fewest possible steps and hence a solution that does not meet those goals, will not succeed.
With the above premise, let's look at possible solutions to the two types of issues listed above. In this blog we can look at L1/L2 issues and continue the rest in the next blog. For L1/L2 type of issues, what if there was
  • A System that can process incoming device logs within seconds
  • A log vault that allows the support engineer to get access to the log in the shortest possible time
  • A search interface to quickly and easily search historical logs
  • An integrated file viewer and a file diff tool that can allow a support engineer to view log file of interest and see the changes in the file as compared to the previous files.
  • A rules and alert engine that can be configured to look for the most common problems (Known issue repository) and alert the support engineer when there are issues found
  • An interface that highlights all the known issue in a given log as soon as it is loaded into the system - using the Known issue repository
  • An interface to show related cases or known bugs - This would require integration into case and bug management system
  • An interface that shows that the current configuration of the system and changes in configuration of interest - Requires an interface to define what constitutes a configuration and which of those need to be tracked for change
  • Integration with case management system and RMA system, to automatically open cases for known issues
While the above list is neither comprehensive nor detailed, it does provide an overview of interfaces, systems, processes that can be put in place to make things easier for the L1/L2 support.

In the next blog, we can continue this topic and look at systems and processes that can help L3/L4 type of issues

Wednesday, June 19, 2013

Troubleshooting troubles! - Part 1

Troubleshooting troubles!

This is a 4 part series focusing on the use case and possible solution for supporting support engineers.

The problem statement

Having done tech-support in the past and now talking to clients or prospects on supporting their support team, what sticks out always is how different the support process is from one company to another. The products being supported are different, the log for each device is different, what you look for in the logs are different and in general everything differs from one product's support team to another. Moreover, how one troubleshoots a problem is also very different even within the same product from one person to another. This brings an interesting question in the Machine data based support space - Is support automation even a possibility.

Support automation as one pre-defined workflow tool, which works for all support groups, can be complex to leaning towards impractical. While some aspects of the troubleshooting can be standardized, most part will be product specific or individualistic. What support needs are tools that can help make their troubleshooting tasks simpler and faster, tools that can be customized to every individual's needs and tools that can be programmed to bubble up all known issues and automate related support processes.

If one takes a step back and looks at troubleshooting as a process across product lines, many common things stand out, irrespective of the product being supported. For example,

·         One of the initial steps of troubleshooting is to look at those log files which have the error message
·         The support engineer might then want to check what happened before and after a particular error message in the log file that has the error message
·         The support engineer might want to see what happened in other files (which represent the other systems/processes of the product) during the time of error. The events surrounding an event of interest might throw light into what went wrong
·         The support engineer might also want to look at output of specific commands represented by different sections in the log file. These sections could represent the configuration of the device or the state of the system as a whole or specific parts of the system
·         More often than not, a problem is due to a change in the system's configuration. The support engineer would want to know what changed and when?
·         Depending on the type of the problem, the support engineer might want to dig into performance or other statistical trends which are being tracked for that system
·         Before digging deeper into the logs to solve the issue, the support engineer might also want to check if this is an isolated event or is prevalent across multiple systems in the field.
·         A product never works in isolation and is always interconnected with other devices in a stack like environment. Support issues are many times not isolated incidents, but dependent on other systems in the stack too. The support engineer in this case needs to analyze across stack.
·         The support engineer might want to check if this is a previously solved problem, so that he doesn't reinvent the wheel. He could do this by going through previous support cases and/or knowledge base article that has a solution for this problem
·         If this is a performance related problem, the support engineer, would typically collect performance statistics, plot them and analyze trends. The engineer would also like to know what was going on in the system when the performance went down, what other events occurred, which configuration changed etc
·         What if this is a known bug? The support engineer would then have to check in the Bug database and make sure it is not a known bug; otherwise, the engineer would waste time ascertaining a known problem.
·         What if this is a known issue, but not formally documented anywhere. In most organizations, there is a wealth of information being discussed on the internal E-mail distribution lists, which don't get documented anywhere. So, the support engineer might search his Inbox to see if he finds anything there
·         What about those cheat sheets that each support engineer has built? Those are details only known to a specific engineer, who has not found the time to document it anyplace still. The support engineer might check if the issue at hand matches anything he remembers to have seen/solved before.

There could be more steps than listed above, but you get the picture. While not all issues require detailed troubleshooting, even simple issues or known problems require time spent by support on the process side. For ex: Even if it is a well-defined, known issue, the support engineer still has to spend time to open a case and update all the details in the case, including the solution. If it is an RMA, a linked case has to be opened to dispatch a replacement part. If the support team receives hundreds of such cases, then a significant time is spent even on known issues.

Support needs tools that help perform the above steps faster and tools that can help automate where possible. In the next part of this series, we will look at a platform and associated tools that can help support groups. 

Thursday, May 23, 2013

To Search or Not to Search

Search as the starting point is a great way to start any analytics with Machine data. As a user, initially you don't know what you are searching for and hence searching for "needle in a hay stack" is easy, because all you need to do is type needle! Yes, you will get a lot of results back which then needs to be filtered/ranked and presented in a meaningful way, but open source search engines, that allow full text search of any document like SOLR/Lucene, provide a good starting point for search implementation. With data indexed, a lot of functionality can be built on top of it to enhance the search ability for the user.

While search is good as a starting point, it tends to get more geeky for the end user when complex search queries need to be created. Even if you manage to write complex queries, everything driven from search alone means a huge load on the search server. Also, the power of full text search diminishes when you want to perform operations similar to those on structured data.

Consider a simple example of a device log that has CPU load information captured as a table through vmstat command of Linux like below. This is just one of the sections along with whole lot of other information in the log

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
8  1 1865144 1018628  56364 1677296   10   16    35   161   97   92  4  0 2  1  0
2  2 1865144 1018628  56364 1677296    0    0     0     4 1007  395  0  0 30  0  0
4  2 1865144 1018628  56376 1677296    0    0     0    92 1016  421  0  0 10  0  0
4  2 1865144 1018628  56376 1677296    0    0     0     0 1008  363  0  0 10  0  0
6  3 1865144 1018628  56376 1677316    0    0     0     0 1003  361  0  0 5  0  0
3  1 1865144 1018628  56384 1677308    0    0     0    20 1006  341  0  0 10  0  0
1  1 1865144 1018628  56384 1677332    0    0     0     8 1009  364  0  0 50  0  0
0  0 1865144 1018628  56396 1677320    0    0     0    92 1019  448  0  0 99  0  0

Let's say we need to plot r, b over time and also look at the CPU utilization(id) when the values of r or b > 2.

A requirement like this is complex, if not impossible when approaching the problem through search. Let's assume that this data was parsed and the data above was available in a database like table. Now, performing analysis as listed above would be relatively simpler. Also, the user can be provided with a simple user interface to browse through the attributes which has been extracted and allow the user to use drag and drop type of interface to slice and dice data. This approach though requires upfront data extraction which has to be persisted in a data store and a simple framework to do it efficiently.

The question hence is not about analytics through search or analytics through pre-parsed content. It is about using the right method based on the type of content, so that effective results can be seen with minimal cost, time and resources. The question then is why not both?. When the data is both parsed and Indexed appropriately, it opens up avenues that are not possible with either approach alone. For example, all the parsed content can be used to add useful context to what is being searched and search as an interface can be used to get to the content in the simplest and quickest way. You are no longer searching for a pattern, but you are searching for a pattern within a well-defined context. For ex: Consider a search term like this, 'Error code 5085" where Software version = 5.4 or 5.6, Model = 9877, License = NFS, Language Setting = US-EN. Assume that each of the context variable being passed to search comes from some section in the log, where the attribute of that section can have multiple values. Also, Pre-parsed content can be transformed or pre-aggregated to increase end user response time for queries being executed. A combined approach hence could provide lot more options than an either/or approach.

At Glassbeam, we are constantly looking at providing practical solutions to Big data problems. The above blog is just one such use case

Social Responsibility begins at home.... and office too!

Vishwanath works as a front desk/security supervisor at our Glassbeam India office. For those of us who work in India would know that the salary he earns is barely enough to meet basic living needs. He has a daughter who finished her 10th standard and got her results a couple of days back. She cleared the exams with flying colours, scoring 88.5% overall with - 98 in math, 98 in Social science, 94 in English, 100 in Kannada (out of 125), 86 in Hindi and 77 in General science. While it is an amazing achievement on its own, the following factors make it even better.

1.   Her background - For many of us who come from a background of financial stability or well educated parents/relatives/siblings, its less difficult to score well than to those children who lack the basic background they need to do well. Kids who do well in such scenario deserve a multi-fold applause for their achievement
2.   She studied in Kannada medium till 8th standard - She moved to English medium only from 9th, which means that not only she learnt English in just 2 years ( She scored 94 in English), but she also learnt all other subjects in English and fared really well too. This is not a small achievement by any means. For some of us here who have studied in our mother tongue for most part of our schooling and switched to English medium of learning later, would know that this is indeed commendable.

She was planning to join one of the top colleges in Bangalore for her pre-university. The fees for the college was Rs 30,000 + other expenses for books etc. Vishwanath had managed to arrange for Rs 10,000 and didn't know what to do for the rest of the money. It was heart-warming to see how quickly the whole team contributed the remaining amount, as soon as they got to know about it, to help her get admitted to the college. She joined the college of her choice and wants to study Engineering once she completes her 2nd PU ( 12th Standard in Karnataka). She is really happy and excited about her college. Vishwanath is happy that the next generation in his family will be much better educated and financially stable than him.

Social responsibility doesn't have to always mean doing something outside. It can start right where you work. Social responsibility in simple words is about being sensitive to the issues around you and taking some action to find solutions or mitigate those issues. For those of us who have crossed the barrier and stabilized ourselves socially and economically, it is a simple action, to help others around us cross.

Glassbeam recognized by CRN and TiECON

Last week was an eventful week for Glassbeam.  We were recognized by the industry with two distinct awards.  First was the nomination to CRN Big Data 100 list.  Second was the selection at TieCon 2013 to be one of the winners of the Big Data Lightning Round.  Srikanth Desikan, our fearless VP Products & Marketing, presented the Glassbeam story on machine data analytics in a truly lightning style in less than 3 minutes!  We will be posting the video of this presentation soon on our website.

Getting these awards is a significant validation for us as a company, and most importantly a great recognition for the hard work being put in by the team across US and India time zones.  There are some real interesting updates coming up on our customer wins, product roadmap and outbound events in coming weeks.  So stay tuned and enjoy your long Memorial day weekend!

Tuesday, May 7, 2013

To Reduce or not to Reduce

“Big Data” has become what dot com used to be in the late nineties. Anyone having a cute little elephant as their mascot is getting “big bucks”. Yes, Hadoop is the buzz word around town.

At a high level, Hadoop allows you to distribute data across multiple nodes so that you can process it in parallel. And that leads us to the other buzz word (or two) – MapReduce. It seems intuitive that because of the distribution, you can quickly “look” for data in parallel and once you find it, you can then combine it to do something useful with it. First part is Map and the second part is Reduce.

It also seems intuitive that no matter how much you distribute Map, Reduce has to be a linear process. In fact, MapReduce, if used incorrectly,  can be a very slow process which is made fast, by throwing a bunch of nodes processing the data in parallel. Now, to apply this philosophy to Machine Log Analytic applications, I have 2 problems with it:

1.      The Map process essentially looks for data based on some rules (regex’s or such). Even if it is distributed, it is an expensive operation. For a given analytic query, why repeat the same expensive operation over and over again? It makes sense to preserve the rules for looking for data, and persist the already looked data.
2.      Why have a slow process in the overall architecture at all? Wouldn’t it be nice to Map but not Reduce?

The solution lies in having a Domain Specific Language (DSL) to allow defining the rules easily. Not only that, make it more strict (more like a language) than mere configurational parameters so that even data accuracy assertions can be inherently built into it.

Use something like an Akka Actor framework which distributes seamlessly across nodes. If this sounds very much like Map – it is.

Make this an asynchronous peer to peer framework, with no master/slave relationship. Empower the actors to do their job and not have to “report” back. Guess what, you just eliminated the linear and time consuming part of MapReduce. You have Map and NO REDUCE. The second benefit of “No Reduce” is that this framework can now scale to a high number of nodes, limited only by the physical limitations of cluster sizes and interconnects.

The final piece of the puzzle is to persist data already looked. Use a data store like Cassandra which has no master/slave relationship and have the actors directly deposit the “looked” data into Cassandra. Since Cassandra is a peer to peer cluster, asynchronous actors can deposit data asynchronously to nodes visible to them.

Glassbeam solution achieves all three. It provides a robust DSL, a highly scalable actor framework and a Cassandra based data store which contains pre-parsed data as well as raw data for subsequent incremental processing.

Stay tuned for more on incremental processing …..

Thursday, April 25, 2013

Maximizing the value of machine data

There are many ways to analyze machine logs and machine data. Most companies start out with simple manual/semi-automated ways to understand logs. The least sophisticated way is for individual users who are responsible for support or system administration to use standard windows or Unix tools to search for strings, find interesting sections withing logs etc. Tools and applications that enable sophisticated search on logs have empowered sysadmins and dramatically increased their productivity.

But machine data contains a lot of strategic value - not just "searchable" events. Simple tools and search based tools and applications are extremely useful for fast troubleshooting needs but when an enterprise is looking to strategically analyze machine log data a multi-dimensional application that can not only search but structure the data for deeper analytics is imperative.

With Glassbeam, companies can realize immediate value for a specific user or a department using Glassbeam search. At the same time, companies can realize higher strategic value by using Glassbeam's tools and applications to parse and model unstructured data for deeper analytics.The initial setup time compared to generic log tools is higher but the longer term value through Glassbeam far out weighs the costs.

Contact us for a complete ROI for your specific needs.