Wolf! wolf!

POSTED BY : Data Scientist
Friday, May 8, 2015

When managing a large IT environment, the first question which an operations team wants to answer is:

Is the environment behaving normally?

Phrased in this manner the question is much too broad;

-What does normally mean? Is normalcy to be based on historical behavior? Or is normalcy based on the current load on the system? Perhaps it should also consider the load on connected systems?

-What does environment mean? Should it include servers, software, and network devices? Should it include the network traffic? What about third-party web services invoked via the Internet?

A common approach is to break the environment down into the set of components being managed (on-premise/ cloud-based), and ask:

Is each component behaving normally?

The question is still broad, though now more tractable but note that the cost of this tractability is that we discarded the forest for the trees.

Trees in the Forest

Having acknowledged this shortcoming, let us drill down one more level and represent a component through a collection of smaller, measurable metrics the Key Performance Indicators for that component. These Key Performance Indicators, or KPIs, are straightforward to measure and the tools which do this are called monitoring tools. These monitoring tools typically run on dedicated servers, polling the target component remotely when possible and utilizing a locally installed agent when not.

Here are some examples of KPIs commonly measure on different types of components:

Component KPIs
Disk Reads per Second, Writes per Second
CPU Load, %Usage
Database Queries per second, Average Response Time
Router Inbound/ Outbound Traffic Velocity

 

Since a single component could provide between 10-100 KPIs, it is not feasible for human engineers to track them individually. But there is typically no need to do this for the most part, an engineer only needs to know

Is each KPI behaving normally?

Discrete vs. Continuous KPIs

KPIs generally fall into two categories based on the values they take: discrete and continuous. A discrete KPI assumes values like Up vs. Down, or Connected vs. Disconnected, and these values can usually be interpreted as normal vs abnormal. So for discrete KPIs, normal behavior is straightforward to define.

A continuous KPI by contrast assumes numerical values; ranges of real numbers. In these cases an operations team needs to specify where the normal behavior is, and they do this by defining thresholdsaround it.

A threshold provides a bound on normal behavior; if you are monitoring the percentage of storage space in use on a hard disk, you might set the upper threshold to 80%. This means that if the percentage of used space exceeds 80%, the KPI has entered the realm of abnormal behavior. So normalcy for continuous KPIs can be established through the definition of thresholds.

Monitoring Tools and False Alarms

KPI thresholds can be configured in monitoring tools, along with the action that should be taken when a threshold breach occurs. These actions are typically very simple examples include sending an email, placing an alert on a dashboard, or opening a support ticket.

Now any competent operations team will identify important KPIs, set thresholds on their values, and define the actions to be triggered by threshold breaches. However, when setting these thresholds they encounter two opposing factors.

Shailesh is an operations manager at an MSP. When a new server gets commissioned in one of his client's IT environments, Shailesh is asked to set thresholds on its KPIs. This machine is used by the client's HR to host some employee-facing application, and isn't considered mission-critical. Shailesh sets the threshold for CPU Usage at 90%.

One day, this client's HR decided to run a company-wide employee satisfaction survey. As more and more employees filled in their feedback, the load on this machine got heavier and heavier. Shailesh's operations team was having a busy day, and by the time the 90% threshold was breached each team already had a few tickets in their queue, and nobody could attend to it.

The machine crashed 10 minutes later. This infuriated the client's employees, which was reflected in the survey responses, leading to the entire survey having to be redone. The client was annoyed, they complained and Shailesh was fired.

The new manager who took his place decided to adjust the threshold down to 70%, where it is breached several times a day. To deal with these false alarms she has added three more members to her team. So much for automation.

Let's summarize the lessons of this story:

  • If the threshold is too close to the range of abnormal behavior, operations may be notified of problems too late
  • If the threshold is set too far from that abnormal range, they will be swamped with false alarms

Given this choice, an operations team should err on the side of caution and choose the lesser of two evils:

False Alarms. Lots of Them.

In the next post we will talk about some ways to solve this problem.

Rohit Chatterjee received his Ph.D. in Number Theory from the University of Wisconsin in 2005, after which he entered the world of High-Frequency Trading, building pricing models and big-data pipelines for financial securities. In his current role with Microland he develops statistical and machine-learning models on large volumes of IT data in order to deliver operations analytics which can be leveraged using automation.

Microland is a leading Hybrid IT Infrastructure services provider and a trusted partner to enterprises in their IT-as-a-Service journey. Microland combines deep IT Infrastructure expertise, process excellence and a passion for customer satisfaction to deliver measurable business value.

  • 657 views
Rohit Chatterjee
Data Scientist
Business Strategy