Wednesday, June 19, 2013

Troubleshooting troubles! - Part 1

Troubleshooting troubles!

This is a 4 part series focusing on the use case and possible solution for supporting support engineers.

The problem statement

Having done tech-support in the past and now talking to clients or prospects on supporting their support team, what sticks out always is how different the support process is from one company to another. The products being supported are different, the log for each device is different, what you look for in the logs are different and in general everything differs from one product's support team to another. Moreover, how one troubleshoots a problem is also very different even within the same product from one person to another. This brings an interesting question in the Machine data based support space - Is support automation even a possibility.

Support automation as one pre-defined workflow tool, which works for all support groups, can be complex to leaning towards impractical. While some aspects of the troubleshooting can be standardized, most part will be product specific or individualistic. What support needs are tools that can help make their troubleshooting tasks simpler and faster, tools that can be customized to every individual's needs and tools that can be programmed to bubble up all known issues and automate related support processes.

If one takes a step back and looks at troubleshooting as a process across product lines, many common things stand out, irrespective of the product being supported. For example,

·         One of the initial steps of troubleshooting is to look at those log files which have the error message
·         The support engineer might then want to check what happened before and after a particular error message in the log file that has the error message
·         The support engineer might want to see what happened in other files (which represent the other systems/processes of the product) during the time of error. The events surrounding an event of interest might throw light into what went wrong
·         The support engineer might also want to look at output of specific commands represented by different sections in the log file. These sections could represent the configuration of the device or the state of the system as a whole or specific parts of the system
·         More often than not, a problem is due to a change in the system's configuration. The support engineer would want to know what changed and when?
·         Depending on the type of the problem, the support engineer might want to dig into performance or other statistical trends which are being tracked for that system
·         Before digging deeper into the logs to solve the issue, the support engineer might also want to check if this is an isolated event or is prevalent across multiple systems in the field.
·         A product never works in isolation and is always interconnected with other devices in a stack like environment. Support issues are many times not isolated incidents, but dependent on other systems in the stack too. The support engineer in this case needs to analyze across stack.
·         The support engineer might want to check if this is a previously solved problem, so that he doesn't reinvent the wheel. He could do this by going through previous support cases and/or knowledge base article that has a solution for this problem
·         If this is a performance related problem, the support engineer, would typically collect performance statistics, plot them and analyze trends. The engineer would also like to know what was going on in the system when the performance went down, what other events occurred, which configuration changed etc
·         What if this is a known bug? The support engineer would then have to check in the Bug database and make sure it is not a known bug; otherwise, the engineer would waste time ascertaining a known problem.
·         What if this is a known issue, but not formally documented anywhere. In most organizations, there is a wealth of information being discussed on the internal E-mail distribution lists, which don't get documented anywhere. So, the support engineer might search his Inbox to see if he finds anything there
·         What about those cheat sheets that each support engineer has built? Those are details only known to a specific engineer, who has not found the time to document it anyplace still. The support engineer might check if the issue at hand matches anything he remembers to have seen/solved before.

There could be more steps than listed above, but you get the picture. While not all issues require detailed troubleshooting, even simple issues or known problems require time spent by support on the process side. For ex: Even if it is a well-defined, known issue, the support engineer still has to spend time to open a case and update all the details in the case, including the solution. If it is an RMA, a linked case has to be opened to dispatch a replacement part. If the support team receives hundreds of such cases, then a significant time is spent even on known issues.

Support needs tools that help perform the above steps faster and tools that can help automate where possible. In the next part of this series, we will look at a platform and associated tools that can help support groups.