Reimagining a new era of incident response as LLMs advance

As Asia-Pacific organisations continue to accelerate their digitalisation, they face great pressure to keep everything running smoothly against an increasingly complex IT environment. The stakes are arguably higher in this region than anywhere else in the world. New Relic’s 2023 Observability Forecast found that Asia-Pacific had the highest median annual outage cost by far — more than double the figure in Europe and nearly 16 times that of North America.

Their IT teams are not only saddled with the responsibility to find and fix incidents as quickly as possible; they also need to prevent those costly incidents from occurring again. Naturally, many IT leaders in the region are watching the emergence of AI and the evolution of large language models (LLMs) and their potential to change incident response as we know it.

Prevention is the north star of incident response with AI, but experience matters

Many teams are already beginning to see how AIOps technology can help achieve benefits in minimising issues or impact on customer experience, such as proactive anomaly detection, incident correlation to reduce alert noise, and automated probable root-cause analysis.

The promise of AI in minimising IT incidents appears infinite, with some even suggesting it will eventually achieve the goal of preventing disruptions and outages altogether. However, skipping any fundamental steps in that journey or limiting the experience of IT teams working through incident responses today could prove detrimental to the advancement of LLMs.

For many IT teams, it still takes too much time to detect potential problems before they turn into incidents. Teams often work reactively, firefighting incidents while never finding time to implement processes that allow them to identify issues before they cause disruptions.

To master prevention with the support of LLMs, teams need to experience finding and fixing incidents. This step cannot be skipped, as it is through the experience of finding and fixing incidents that teams learn the skills to implement mitigation strategies and take preventative measures. This experience will enrich both the human teams and the capability of LLMs to understand and rationalise extensive data sets and accomplish the varied array of tasks within the incident response lifecycle.

Three ways LLMs will transform incident response

The incident response lifecycle can vary from organisation to organisation, and even from team to team. Here are some possibilities within key tasks across the incident response lifecycle:

  • Research: When an incident occurs, the first step an engineer takes is to gather information and research the problem space. LLMs have a significant role to play in this process. With access to current and historical data, LLMs will be capable of analysing the incident, searching past incidents to draw on past experiences, and reasoning over this data to recommend a potential path forward. By undertaking the role of the researcher, SRE teams will save significant manual hours.
  • Troubleshooting and diagnosis: As LLMs evolve, teams will be able to draw on the same research function using broader knowledge bases to help investigate an incident, including identifying run-books applicable to an incident. As the knowledge base extends beyond the organisation to external knowledge, AI agents will be able to perform automated root cause analysis through iterative evaluation of hypotheses that draw on local experiences and world knowledge. They will be able to mimic human cognition, perform reasoning, and take actions through dialogue with human teams to fill in any gaps from earlier stages, then assist by making suggestions. The value to engineering lies in a shorter mean-time-to-understanding of the impact and cause of incidents, while the value to the business lies in a shorter mean-time-to-resolution.
  • Incident postmortems and documentation: After an incident, it is common for engineers to collect, summarise, and produce a postmortem. An incident postmortem involves dissecting failures to gain insights into why they occurred, how they impacted operations, and, most importantly, how to prevent them in the future. This process can take weeks. Through search, summarisation, and reasoning abilities, LLMs can facilitate the initial stages of creating a post-incident review by collecting, collating, summarising, and analysing the data, then making recommendations relating to mitigation strategies. This reduces the cognitive load on engineers and saves them a significant amount of time.

As LLMs become more sophisticated, organisations and their IT teams can certainly look forward to their benefits in managing and eventually preventing incidents. The caveat is that there are no shortcuts to the process, and more importantly, there is no substitute for the lived experience of human teams.

LLMs require human teams to have a wealth of lived and documented incident response experience to effectively perform tasks based on logical reasoning. Only then will the tools produce the anticipated positive impact on incident response times, resolution times, and overall outcomes. The next chapter of incident response will be powered by greater efficiency in how organisations respond, manage, and learn from incidents, underscored by intelligence, automation, and human-machine collaboration.