Demystifying ITOA - I

POSTED BY : Data Scientist
Wednesday, June 17, 2015

What is ITOA?

ITOA is the acronym for IT Operations Analytics. It aims to assist IT operations teams by applying analytic methods to operations data for

  • Incident Prediction
  • Root Cause Analysis
  • Performance Optimization
  • Capacity Optimization

Here is a definition of ITOA from the BMC website

"IT Operations Analytics (ITOA) automates the process of collecting, organizing, and identifying patterns in highly distributed, diverse and fast-changing service and application data to identify problems faster and improve IT system performance."

There is a wealth of information in this definition, so let us take it apart.

What is the data under consideration?

ITOA considers service and application data.

I wondered why operations data is not included in this list (or in any ITOA definition), and I found a partial explanation. There is a wealth of information in ticket data, knowledge bases, the CMDB, RCA  reports etc., and some vendors like Evolven take this type of information into account.

 

What is the nature of this data?

It is highly distributed (from multiple sources). Typically this includes application logs, infrastructure metrics, application performance metrics and network traffic. However, not every ITOA tool works on every one of these data types. For example, Netuitive works on performance metrics, Splunk on logs and ExtraHop on network data.

It is diverse (not all of the same type, format or structure). For example, historical CPU consumption on a server can be represented as a time series of numbers, but log data cannot. In terms of implementation, this implies that you will need several data stores.

It is rapidly changing. ITOA really emphasizes its real-time aspect - the point is to collect data as it is generated and act on it immediately.

 

What are the goals of ITOA?

  • To identify problems - ideally problems are predicted before they occur, but when they do occur ITOA helps in identification and triaging
  • To improve performance - this includes uptime, speed, and capacity optimization

Another goal that MSPs  like Microland are interested in is manpower reduction. The cost savings we achieve using ITOA are passed on to our customers.

 

How does ITOA achieve this?

ITOA data needs to be collected and analyzed. Collection means both ingest and storage, and this should be automated since it all happens so fast. Importantly, the analysis (the pattern identification) also needs to be automated. This can mean many things, but typically some kind of machine learning eventually needs to be brought in.

My last two posts talk about an ITOA POC conducted here at Microland and my next posts will contain more examples

 

So.. is it APM ?

This is a common question. ITOA is a little different from APM.

Let us start with a quick revision of the IT Management landscape and see where it fits in. This will also give us a chance to recall how IT teams solve problems and what is still lacking.

IT management has two broad areas, systems management and services management. Systems management is concerned with how the various IT components perform. Services management is concerned with how the IT team ("IT operations") delivers their service to the business. Operations teams use methods and techniques from both these viewpoints to ensure that their IT ecosystem is under control.

IT Systems Management takes care of the IT systems. These systems have a lifecycle, involving

  • Provisioning
  • Patches and upgrades
  • De-provisioning

There are tools and processes for managing these lifecycles. While the systems are live, operations needs to monitor the following with their own tools and processes

  • Server monitoring
  • Network monitoring
  • User activity monitoring
  • Capacity monitoring

Systems management thus involves component lifecycles and their monitoring.

Application Performance Management (APM) - Often the big picture is lost when focus is placed on the performance of individual components (the big picture being the performance of the IT system from the point of view of the people who use it). APM seeks to remediate this by making the End User Experience (EUE) central. APM systems measure metrics in two areas - the EUE and the underlying resource consumption in the system. The relationships between these sets of measures are then determined so that the second set can be used as a proxy for the first in the live environment.

For example, IT Operations has learnt that a certain sales dashboard is often slow. They measure the Average Response Time for that dashboard and find it correlates with the number of open network connections on the Oracle DB server. With this knowledge, they decide to continuously monitor this number (of network connections) and keep it within limits.

APM does not rely on human feedback to get a handle on the EUE  (there are no questionnaires distributed by the IT department). All inputs to APM are from monitoring data that is analyzed using statistical techniques (like correlation). ITOA builds on APM by extending these techniques, but with more general goals than just the EUE.

Business Transaction Management (BTM) is similar to APM in that a relationship is sought between EUE and resource consumption. But instead of measuring EUE on individual pages in a web application, BTM seeks a view of an entire business transaction and monitors the EUE at this level. The methods are necessarily different because one needs a view of a transaction, a view that is not readily available and needs to be assembled by monitoring data using Network Packet Analysis and Log Analysis techniques (these are powerful techniques that ITOA uses as well).

This provided a brief overview into IT Systems Management. Now let us look at IT Service Management (ITSM).

ITSM seeks to monitor and continuously improve IT service delivery in a process-driven manner. Processes are defined for how to respond to IT problems, including logging the initial complaint, defining escalation rules, responding to frequently occurring problems, tracking assets and maintaining knowledge repositories. ITSM (and the associated standards known as ITIL) has been very successful - it has gained widespread adoption with over 1.5 million certified professionals around the world. ITOA and ITSM can complement each other, with ITSM insights providing the starting points for ITOA investigations.

Business Service Management (BSM) builds on ITSM but takes a customer-centric view by shifting focus to business. While ITSM is by IT engineers and for IT engineers, BSM takes a business viewpoint for all questions, analyses and decisions. Business needs (rather than the needs of the IT organization) drive service improvements. Service degradation is reported in terms of business impact rather than in terms of IT components. Thus a BSM initiative requires different tools and processes to capture metrics of business criticality. The BTM techniques mentioned above are also used by ITOA, and so ITOA should be able to provide inputs for BSM. BSM adoption has been slow and only time will tell if ITOA will help it along.

If we look back at the BMC definition for ITOA with this background, it is clear that ITOA fits into the systems side of this story. This is probably why ITOA typically does not inspect service data like incident or change tickets.

How is ITOA then different from APM and BTM?

ITOA is enabled by two distinct but related developments - Big Data and Machine Learning.

Big Data provides the engineering foundation for the analyses we want to do. With the big data tools available today, much more analysis is possible than was just a few years ago.

"Operations team typically fail to leverage the data generated by their management tools because they don't have dedicated data warehouses to store such data; they don't necessarily have the skills to perform the necessary analysis; and/or they simply don't have the time to devote to forensic analysis or predictive monitoring." - Mark Settle, IHS (formerly with BMC)

We now have the ability to ingest monitoring data at high speeds using specialized lightweight protocols like AMQP and MTTQ, and store the data in highly scalable systems like Cassandra, MongoDB, or ElasticSearch. More important, we can perform live analyses on the streaming data that enables real-time requirements of ITOA.

Machine Learning is "the science of getting computers to act without being explicitly programmed" (Stanford Coursera). In the context of ITOA, this means that we should not be trying to define the behavior we are interested in, the computers should learn it themselves. Under the hood this means that the behavior of certain programs is defined by parameter configurations ("models") that are arrived at through statistical analyses conducted by other programs.

An Approach to Incident Prediction

A hypothetical ITOA system may have a program called the "Incident Predictor". This program has been written to notify the operations team of an impending incident in case certain "warning behavior" is observed. This warning behavior may be one of the following

  1. "The available memory on the server _________ drops below __% and stays below that level for ___ minutes"

AND

  1. "The logs for the application ______ contain more than ___ WARNING messages in the last 20 minutes"

This logic is pretty straightforward; all that is left is to fill in the blanks. This is done by a historical analysis of data through which a model is trained. Once the model is ready the Incident Predictor is ready to roll; the ITOA system will continuously provide the server's memory metrics and the relevant application logs, and it will trigger an alarm if those conditions are met.

Getting More Complex

Once this picture is clear, we can imagine how it generalizes. We may train models to fill in the following blanks

  1. "The ____________ on the _________ drops below/above __% and stays below/above that level for ___ minutes"

AND/OR

  1. "The logs for ____________ ______ contain more than/less than ___ ____________ messages in the last _________"

But why stop at two conditions? Maybe there is a type of incident that can be forecasted accurately by considering 12 conditions simultaneously. This is why ITOA asks for generic machine learning techniques rather than templates to fill up.

I hope this has been interesting. I'm going to close with a quote to pique your interest even more

"..at the heart of these ITOA systems are pattern matching technologies and analytics engines that can use complex event processing (CEP), machine event and log indexing and search, behavior learning engines (BLE).."

- Will Cappelli, Gartner

In my next post I will cover some more examples. In the meantime I welcome your questions and feedback via Twitter or in the comments below.

 

 

 

  • 958 views
Rohit Chatterjee
Data Scientist
Business Strategy