Machine Learning to the Rescue

POSTED BY : Data Scientist
Tuesday, May 19, 2015

Last Time

In the last post, I talked about

Monitoring IT environments
Thresholds on normal behavior for individual KPIs
False alarms arising from setting thresholds away from the danger zone

The key to reducing false alarms is to set thresholds correctly, which brings us back to our ongoing question: What does correctly mean? What is the normal behavior for a KPI?

To appreciate the difficulty with this question, consider the following:

The network traffic to and from a corporate email server is expected to be high at 9am on a business day, but not at 3am on a Saturday night.
     Normal behavior is different for these two time periods, so they should have different thresholds for     their KPIs.

 

The frequency of disk reads on a database server is expected to be high if there is a corresponding increase in the number of user requests hitting the application being served. But if the application is dormant, we don't expect a lot of disk activity.
     So the definition of normal depends on the user load, and therefore thresholds should adjust     accordingly.
 

We could therefore replace the fixed, static thresholds for each KPI with thresholds which change based on context, based on the time of day and the day of the week, on business activity or activity on connected components.

 

This idea sounds good, but implementing it naively would be a nightmare. There are just too many KPIs to consider, each with their own behavioral patterns. One would need a small army of engineers simply to define and keep up-to-date all the different thresholds, and a KMS to maintain the knowledge across people transitions.

A better strategy is to use a computer to learn the behavior for each KPI

Dynamic Thresholds

At Microland, we use neural networks to attack this learning problem. For our purpose, a neural network provides a way to encode an association (a function) between inputs and outputs. More importantly, the neural network can infer this function from a training set of inputs and outputs.

For each KPI we train a neural network to make a short-term forecast based on recent KPI values. With these forecasts, we can:

  1. Suppress a false alarm if a KPI value crosses a static threshold but the forecast indicates that it will come back within range

In fact we can do more with these predictions, and here are some low-hanging fruit:

  1. Provide advance warning of alarms when future KPI predictions cross thresholds
  2. Raise an alarm if the KPI's behavior deviates significantly from what was predicted, even if away from the danger zone.

We create a window around the forecast, and call these the dynamic thresholds for the performance counter.

Nota Bene

Neural networks have been in the news a lot lately, especially in the context of deep learning. Our application here is much simpler we are not looking for patterns which are completely invisible to the human eye, rather we are automating the tedious task of cataloguing patterns which an observant human might notice. Don't ride the hype cycle! There is no deep learning going on here.

 

Performance

Let's see how this method performed in a production environment. The following graphs were created as part of a POC recently completed here at Microland. The monitoring data is real production data, as are the incident tickets which were deemed to be false alarms. We determined that for a particular performance counter on a single Window server; our method would have prevented the creation of 60% of the false alarms from the month of February. For this blog post I will cherry-pick two of the better days :) .

In each graph, the green line indicates the static lower threshold for the performance counter; when the yellow line drops below this static threshold an alarm is generated and a ticket cut in our SmartCenter ITSM tool. The red and blue lines are the dynamic thresholds computed by our method; as long as the yellow graph stays within the red and blue lines we declare that the counter is safe.

False Alarms on Feb 3 2015

False Alarms on Feb 11 2015

Finally, here is a graph over a longer time span to provide an idea of the accuracy of the forecasting algorithm:

 

Alternate methods

For the performance counter discussed in this post, the neural network setup was very simple the recent history of the counter is used to predict a short-term horizon. But this is not the only setup one might want, so let me list a few more.

1. Include the time-of-day in the input
2. Include the day of the week

Both of these could be useful if the behavior of the counter follows daily or weekly cycles. (Of course one could also just train different models for the different periods, but that would involve figuring out the periods beforehand. Let the neural network figure it out!).

3. Include the behavior of related performance counters

Some obvious candidates for related counters might be a) other counters on the same device, b) counters on an upstream device, and c) business activity metrics from the application being served. (Here's a question: how can you automate the task of finding the right related counters?)

One might also want to create new performance counters, not captured by the monitoring tools. For example,

4. A count of log messages at a warning level on the device

What's next?

Over the next few months I will deploy and fine-tune this method within our RIMS environment, modeling more performance counters and ironing out scalability issues. I will post updates on our progress on this blog, and I look forward to being able to offer this to our customers later this year!

  • 1,206 views
Rohit Chatterjee
Data Scientist
Business Strategy