Making AIOPs work for your digital transformation projects

Businesses are pursuing digital transformation projects at an accelerated pace to become agile, respond more quickly to customer needs, and drive sustainability. In their quest to become agile, businesses are turning to a hybrid cloud model to meet their needs.

Companies are designing applications specifically to take advantage of hybrid cloud by switching to microservices, which can be deployed and scaled horizontally on-demand, with Kubernetes as a de facto platform to run on. However, while hybrid cloud and Kubernetes-based architecture provides agility, it also adds complexities that operations teams must manage, especially when deploying across multiple clouds.

For instance, managing these modern application environments in hybrid cloud deployment while using traditional monitoring tools designed to keep track of independent technology domains, such as servers, storage, network, and applications. Siloed management tools will result in a fragmented approach to hybrid cloud operations with no holistic, end-to-end view and can put the brakes on a digital transformation initiative. This also results in energy and resource wastages, increasing the carbon footprint of the project.

A new approach to managing the hybrid cloud environments is called artificial intelligence for IT operations (AIOps) and uses AI with ephemeral and dynamic operations data to auto-detect abnormalities in advance, predict potential risks, and raise alarms to warn of a disaster-in-waiting. It cuts through noise and identifies, troubleshoots, and resolves common issues within IT operations. It brings together data from diverse sources and performs a real-time analysis at source. It also understands and analyses historic and current data to link anomalies and observed patterns to relevant events via machine learning (ML). Finally, it initiates appropriate automation-driven action, which can yield uninterrupted improvements and fixes.

AIOps roadmap

In this article, I describe the stages in the different levels of AIOps implementation. AIOps provides the IT operations team with information about anomalous events. Thereafter, it provides insight into what caused these events and predicts when these events will recur. Finally, it either resolves these events directly or offers suggestions for a resolution.

Based on this approach, I propose three stages in my AIOps roadmap: Visibility, Insight, and Execution. I recommend that organisations utilise a stage approach to AIOps starting with the first stage, which is detecting failures, and then work their way up as they become familiar with its capabilities.

Image courtesy of Hewlett Packard Enterprise

Stage 1: Visibility

In AIOps, detecting anomalies can have two approaches: supervised and unsupervised learning. In the supervised learning approach, the AI algorithm is trained with labelled data that demarcates data points that are anomalous and those that are not. Popular examples include k-NN (k nearest neighbours) and Bayesian networks.

In the unsupervised learning method, the algorithm detects anomalies by forming clusters of data points and selecting those that are not within the clusters because they are significantly different from others. It assumes that most data points represent IT operations running smoothly and that the rare events that significantly differ are anomalies. Examples of unsupervised AI algorithms are Autoencoders and PCA.

Between the two approaches, an IT practitioner would consider supervised AI algorithms to be comparatively more accurate because they incorporate prior information from manual training. However, unsupervised AI algorithms can be implemented faster because they do not require manual training.

Stage 2: Insight

The second stage illustrates the capabilities of AI in providing insight to the IT operators. AI combs through data, removing noise and reducing the number of events that the IT operator must respond to.

IT events data could be temporal (e.g., timestamp) or spatial (e.g., resource dependency). Thereafter, it recognises patterns within the data and groups data items into clusters based on its inference. AI achieves this using unsupervised ML algorithms, such as Gaussian mixture models or k-Means clustering, on the spatial-temporal data.

These patterns are used to identify the root causes of events by correlating events to specific patterns. AI then uses this correlation from a priori information to predict incidents, saving the IT teams time and effort on figuring out which alerts are important and what caused them.

In this stage, the insight provided by AIOps simplifies the responsibilities of the IT operator in redressing issues.

Stage 3: Execution

This final stage involves the use of AIOps in either responding to alerts directly or suggesting remediations to the IT operator.

The ML algorithm observes how past tickets were dealt with by the IT operator and taps this information to handle similar tickets. Those tickets that cannot be effectively resolved by AIOps are triaged to the relevant IT operator. AI then learns from the solutions the IT operator provides to the triaged tickets and uses them to resolve similar tickets the next time around.

This is achieved by utilising natural language processing algorithms, such as hidden Markov models or neural networks, which scan through the text in the tickets and cluster similar ones. Thereafter, AI identifies the solutions input by operators for related clusters.

The future of businesses with AIOps

AIOps has the potential to enormously impact enterprises’ digital transformations. However, while it aims at automating and aligning priorities based on business impact, its intention is also to put people in control of an otherwise unmanageable onslaught of data, by accelerating the understanding and remediation problems.

Gartner has predicted that a crucial part of a digital business succeeding would depend on an agile, high-performing IT workforce. Such employees would be known as “versatilists”, who combine deep skills, broad experience, and strong business networking.

As know-how and algorithms refine over time, improved predictive capabilities can be implemented, with more context-infused AIOps. The future of businesses across the world is about adopting AI and ML to untangle the growing and unwieldy mess of data. And, although AIOps has covered significant ground, it still has a long journey ahead.