Demystifying ITOA - II

POSTED BY : Data Scientist
Thursday, August 6, 2015

In my last post I introduced ITOA and placed it in the context of APM and ITSM. This time I want to use examples to illustrate the kind of analytics we are aspiring to. I'm going to start with a simple "BI-using-Excel" type of analysis and then go from there.

IT Business Intelligence

In a hypothetical IT environment, Suresh the Service Delivery Manager is looking at the number of incidents which occurred on their servers in the last 3-months.

He graphs the incident counts by server:

Hmm.. this is not very enlightening. Next, he tries grouping these servers according to the application they are serving:

Now he knows which applications are getting hit by how many incidents, but he still doesn't have actionable information.
Finally Suresh groups the incidents according to the technology involved.

Bingo! It is clear that the servers hosting MySQL have been hit by the most incidents in this time period. With this information Suresh digs deeper, discovers that MySQL has been consistently misconfigured in his environment, and gets his engineers to fix the problem.

This first example might seem extremely basic, and I don't even know if the ITOA folks consider this "real ITOA" or just good old BI.
But I wanted to emphasize that a lot can potentially be gained even from a very simple analysis.

Using Big Data Tools - Log Analysis

Nilofer manages her client's website, a Ruby on Rails application running on top of Apache and nginx with MongoDB at the back. Her team has recently deployed the ELK stack for realtime log analysis, and Nilofer is eager to discover patterns which may allow her to predict incidents before they occur. The team has recently noticed that the CPU usage on the MongoDB server suddenly spikes every hour on the hour, for around 10 minutes before settling back down again. And each time an engineer investigated, they saw that the mongod daemon was responsible for this increased usage.

So Nilofer decides to look in the database logs using Kibana's visualization capabilities. She starts by looking for warning and error messages, but the timing of these messages doesn't coincide with these spikes at all. So she decides to look at the logs within the vicinity of the spikes, and finds that a certain query is getting fired repeatedly, every hour on the hour. She alerts the development team who discovers that a bug in their JavaScript was triggering this query from every connected client device; the fix thankfully turns out to be straightforward and is tested and deployed the same day.

This example is still only borderline-ITOA. The SDM used a Big Data log-analysis tool but didn't really perform any actual analytics per se.

Furthermore, the connection between CPU spikes and query spikes was made manually, with the tool acting purely as an enabler.


Gokul has implemented an algorithm for predicting incidents in his company's IT environment. This algorithm takes as its contextual inputs:

- Hardware performance counters across their servers
- Performance counters for software: OS, middleware, and applications
- Log messages generated by applications
- Syslogs from OS daemons and network devices

He does not have access to network traffic, nor can he look inside the JVMs of the applications - after all, he needs to demonstrate benefits to his Operations team before asking for more investments from them.

His approach is straightforward to describe:
1. Baseline the behavior of each performance counter individually
2. If / when any of these counters deviate from their norms,
   a. The algorithm looks for Errors and Warnings in log messages.
   b. The deviations and problematic log messages are provided to a pattern matching engine.
   c. If the engine recognizes this behavior, it looks up the associated incidents from the ticket history.
   d. This list of historical incidents is then provided to his IT Operations team through a dashboard.

Now his operations team has an advance warning system as well as a head start on diagnosis.

Some comments on algorithms:
- There are simple methods and complex methods for Baselining. You should use the simplest method you can get away with (i.e. which your data will allow).
- The same statement applies to Pattern matching. Note that you want some flexibility in your matching algorithm - the whole idea is that it recognizes similar behavior as being similar!
And some comments on deployment:
- One thing I like about this example is that it doesn't try to do "too much". One would certainly want to develop it to the point where it does automated RCA and remediation. But it has utility even before it reaches that stage!


I just described three hypothetical examples of how analysis of operations data can provide benefits to managing IT operations:

1. Using "small data" with Excel,
2. Using a sophisticated tool which enables an intelligent human to notice patterns,
3. Using an automated pattern detection system.

My next blog post will be motivated by Example (2), and will look at the utility of having the right visualization tools in the hands of the right people. After that I will begin a discussion on machine learning for solving IT Operations challenges, and how to architect an ITOA platform to plug in these clever algorithms.

Until next time!

Rohit Chatterjee
Data Scientist
Business Strategy