Search as the starting point is a great way
to start any analytics with Machine data. As a user, initially you don't know
what you are searching for and hence searching for "needle in a hay
stack" is easy, because all you need to do is type needle! Yes, you will
get a lot of results back which then needs to be filtered/ranked and presented
in a meaningful way, but open source search engines, that allow full text search of
any document like SOLR/Lucene, provide a good starting point for search
implementation. With data indexed, a lot of functionality can be built on top
of it to enhance the search ability for the user.
While search is good as a starting point,
it tends to get more geeky for the end user when complex search queries need to
be created. Even if you manage to write complex queries, everything driven from
search alone means a huge load on the search server. Also, the power of full
text search diminishes when you want to perform operations similar to those on
structured data.
Consider a simple example of a device log
that has CPU load information captured as a table through vmstat command of
Linux like below. This is just one of the sections along with whole lot of
other information in the log
r b swpd
free buff cache si
so bi bo in cs us
sy id wa st
8 1 1865144 1018628 56364
1677296 10 16 35
161 97 92 4 0 2 1 0
2 2 1865144 1018628 56364
1677296 0 0
0 4 1007 395 0 0 30 0 0
4 2 1865144 1018628 56376
1677296 0 0
0 92 1016 421 0 0 10 0 0
4 2 1865144 1018628 56376
1677296 0 0
0 0 1008 363 0 0 10 0 0
6 3 1865144 1018628 56376 1677316
0 0 0 0
1003 361 0 0 5 0 0
3 1 1865144 1018628 56384
1677308 0 0
0 20 1006 341 0 0 10 0 0
1 1 1865144 1018628 56384
1677332 0 0
0 8 1009 364 0 0 50 0 0
0 0 1865144 1018628 56396
1677320 0 0
0 92 1019 448 0 0 99 0 0
Let's say we need to plot r, b over time
and also look at the CPU utilization(id) when the values of r or b > 2.
A requirement like this is complex, if not
impossible when approaching the problem through search. Let's assume that this
data was parsed and the data above was available in a database like table. Now,
performing analysis as listed above would be relatively simpler. Also, the user
can be provided with a simple user interface to browse through the attributes
which has been extracted and allow the user to use drag and drop type of
interface to slice and dice data. This approach though requires upfront data
extraction which has to be persisted in a data store and a simple framework to
do it efficiently.
The question hence is not about analytics
through search or analytics through pre-parsed content. It is about using the
right method based on the type of content, so that effective results can be
seen with minimal cost, time and resources. The question then is why not both?.
When the data is both parsed and Indexed appropriately, it opens up avenues
that are not possible with either approach alone. For example, all the parsed
content can be used to add useful context to what is being searched and search
as an interface can be used to get to the content in the simplest and quickest
way. You are no longer searching for a pattern, but you are searching for a
pattern within a well-defined context. For ex: Consider a search term like
this, 'Error code 5085" where Software version = 5.4 or 5.6, Model = 9877,
License = NFS, Language Setting = US-EN. Assume that each of the context
variable being passed to search comes from some section in the log, where the
attribute of that section can have multiple values. Also, Pre-parsed content
can be transformed or pre-aggregated to increase end user response time for
queries being executed. A combined approach hence could provide lot more
options than an either/or approach.
At Glassbeam, we are constantly looking at
providing practical solutions to Big data problems. The above blog is just one
such use case
No comments:
Post a Comment