Troubleshooting troubles!
This is a 4
part series focusing on the use case and possible solution for supporting
support engineers.
The problem statement
Having done
tech-support in the past and now talking to clients or prospects on supporting
their support team, what sticks out always is how different the support process
is from one company to another. The products being supported are different, the
log for each device is different, what you look for in the logs are different and
in general everything differs from one product's support team to another.
Moreover, how one troubleshoots a problem is also very different even within
the same product from one person to another. This brings an interesting
question in the Machine data based support space - Is support automation even a
possibility.
Support
automation as one pre-defined workflow tool, which works for all support groups,
can be complex to leaning towards impractical. While some aspects of the
troubleshooting can be standardized, most part will be product specific or
individualistic. What support needs are tools that can help make their
troubleshooting tasks simpler and faster, tools that can be customized to every
individual's needs and tools that can be programmed to bubble up all known
issues and automate related support processes.
If one takes a
step back and looks at troubleshooting as a process across product lines, many
common things stand out, irrespective of the product being supported. For
example,
·
One of the initial steps of troubleshooting is to look
at those log files which have the error message
·
The support engineer might then want to check what
happened before and after a particular error message in the log file that has
the error message
·
The support engineer might want to see what happened
in other files (which represent the other systems/processes of the product)
during the time of error. The events surrounding an event of interest might
throw light into what went wrong
·
The support engineer might also want to look at output
of specific commands represented by different sections in the log file. These
sections could represent the configuration of the device or the state of the
system as a whole or specific parts of the system
·
More often than not, a problem is due to a change in
the system's configuration. The support engineer would want to know what
changed and when?
·
Depending on the type of the problem, the support
engineer might want to dig into performance or other statistical trends which
are being tracked for that system
·
Before digging deeper into the logs to solve the
issue, the support engineer might also want to check if this is an isolated
event or is prevalent across multiple systems in the field.
·
A product never works in isolation and is always interconnected
with other devices in a stack like environment. Support issues are many times
not isolated incidents, but dependent on other systems in the stack too. The
support engineer in this case needs to analyze across stack.
·
The support engineer might want to check if this is a
previously solved problem, so that he doesn't reinvent the wheel. He could do
this by going through previous support cases and/or knowledge base article that
has a solution for this problem
·
If this is a performance related problem, the support
engineer, would typically collect performance statistics, plot them and analyze
trends. The engineer would also like to know what was going on in the system
when the performance went down, what other events occurred, which configuration
changed etc
·
What if this is a known bug? The support engineer
would then have to check in the Bug database and make sure it is not a known bug;
otherwise, the engineer would waste time ascertaining a known problem.
·
What if this is a known issue, but not formally
documented anywhere. In most organizations, there is a wealth of information
being discussed on the internal E-mail distribution lists, which don't get
documented anywhere. So, the support engineer might search his Inbox to see if
he finds anything there
·
What about those cheat sheets that each support
engineer has built? Those are details only known to a specific engineer, who
has not found the time to document it anyplace still. The support engineer
might check if the issue at hand matches anything he remembers to have
seen/solved before.
There could be
more steps than listed above, but you get the picture. While not all issues
require detailed troubleshooting, even simple issues or known problems require
time spent by support on the process side. For ex: Even if it is a
well-defined, known issue, the support engineer still has to spend time to open
a case and update all the details in the case, including the solution. If it is
an RMA, a linked case has to be opened to dispatch a replacement part. If the
support team receives hundreds of such cases, then a significant time is spent
even on known issues.
Support needs
tools that help perform the above steps faster and tools that can help automate
where possible. In the next part of this series, we will look at a platform and
associated tools that can help support groups.
No comments:
Post a Comment