Machine Learning to the Rescue
In the last post, I talked about
The key to reducing false alarms is to set thresholds correctly, which brings us back to our ongoing question: What does correctly mean? What is the normal behavior for a KPI?
To appreciate the difficulty with this question, consider the following:
We could therefore replace the fixed, static thresholds for each KPI with thresholds which change based on context, based on the time of day and the day of the week, on business activity or activity on connected components.
This idea sounds good, but implementing it naively would be a nightmare. There are just too many KPIs to consider, each with their own behavioral patterns. One would need a small army of engineers simply to define and keep up-to-date all the different thresholds, and a KMS to maintain the knowledge across people transitions.
A better strategy is to use a computer to learn the behavior for each KPI
At Microland, we use neural networks to attack this learning problem. For our purpose, a neural network provides a way to encode an association (a function) between inputs and outputs. More importantly, the neural network can infer this function from a training set of inputs and outputs.
For each KPI we train a neural network to make a short-term forecast based on recent KPI values. With these forecasts, we can:
- Suppress a false alarm if a KPI value crosses a static threshold but the forecast indicates that it will come back within range
In fact we can do more with these predictions, and here are some low-hanging fruit:
- Provide advance warning of alarms when future KPI predictions cross thresholds
- Raise an alarm if the KPI's behavior deviates significantly from what was predicted, even if away from the danger zone.
We create a window around the forecast, and call these the dynamic thresholds for the performance counter.
Neural networks have been in the news a lot lately, especially in the context of deep learning. Our application here is much simpler we are not looking for patterns which are completely invisible to the human eye, rather we are automating the tedious task of cataloguing patterns which an observant human might notice. Don't ride the hype cycle! There is no deep learning going on here.
Let's see how this method performed in a production environment. The following graphs were created as part of a POC recently completed here at Microland. The monitoring data is real production data, as are the incident tickets which were deemed to be false alarms. We determined that for a particular performance counter on a single Window server; our method would have prevented the creation of 60% of the false alarms from the month of February. For this blog post I will cherry-pick two of the better days .
In each graph, the green line indicates the static lower threshold for the performance counter; when the yellow line drops below this static threshold an alarm is generated and a ticket cut in our SmartCenter ITSM tool. The red and blue lines are the dynamic thresholds computed by our method; as long as the yellow graph stays within the red and blue lines we declare that the counter is safe.
False Alarms on Feb 3 2015
False Alarms on Feb 11 2015
Finally, here is a graph over a longer time span to provide an idea of the accuracy of the forecasting algorithm:
For the performance counter discussed in this post, the neural network setup was very simple the recent history of the counter is used to predict a short-term horizon. But this is not the only setup one might want, so let me list a few more.
1. Include the time-of-day in the input
2. Include the day of the week
Both of these could be useful if the behavior of the counter follows daily or weekly cycles. (Of course one could also just train different models for the different periods, but that would involve figuring out the periods beforehand. Let the neural network figure it out!).
3. Include the behavior of related performance counters
Some obvious candidates for related counters might be a) other counters on the same device, b) counters on an upstream device, and c) business activity metrics from the application being served. (Here's a question: how can you automate the task of finding the right related counters?)
One might also want to create new performance counters, not captured by the monitoring tools. For example,
4. A count of log messages at a warning level on the device
Over the next few months I will deploy and fine-tune this method within our RIMS environment, modeling more performance counters and ironing out scalability issues. I will post updates on our progress on this blog, and I look forward to being able to offer this to our customers later this year!