Observability in operations: How Lenovo uses Splunk

During the past decade, machine-generated data faced tremendous growth due to the increasing number of machines in IT infrastructures and the Internet of Things devices. All this data provided considerable insight for businesses, and Splunk – a technology firm based in San Francisco – was established in 2003 to make sense of it all.

The Hong Kong arm of technology giant Lenovo recently decided to utilise Splunk’s data platform to manage their data growth and improve their operations. Lenovo’s DevOps team, in particular, uses big data to detect and react to issues in real time, with centralised monitoring and analytics.

For reference, Lenovo operates in 180 markets worldwide and receives continuous streams of data from infrastructure logs, security software logs, and application logs. Amassing this much data necessitates a solution for proactive monitoring and intelligent analytics to allow their DevOps team to promptly identify and respond to security incidents.

Before using Splunk’s platform, Lenovo’s security engineers had to retrieve and correlate information from various system logs, then integrate and present the results in a visual format. This process made troubleshooting slow and complicated, demanding hours of engineers’ time. If there was a virus infection, for example, engineers were forced to sift through numerous disparate terminal security platforms for relevant details before having to manually correlate all the data.

In search of a scalable platform for log management and security analytics, Lenovo evaluated the efficiency and cost-effectiveness of Splunk’s platform and ultimately selected it for its stability, performance, and ability to simplify system development. Currently, Splunk is Lenovo’s IT operation data platform and one of the key components in its AIOps ecosystem.

The platform enables Lenovo to integrate their IT service management systems, business application systems, infrastructure platforms, and user behavior, providing their DevOps team with a real-time intelligent dashboard and supporting reports that are said to help with operation management efficiency.

Operational complexity

The need for the Splunk platform can be attributed to Lenovo’s complex IT environment. Operating across the globe, Lenovo sells a range of products both in-store and online, necessitating high requirements for system stability. This, however, results in significant challenges to traditional IT operations. Lenovo is supported by multiple application systems with complex architecture involving packaged and self-developed software across a mixed cloud environment.

Lenovo’s centralised monitoring management team enables infrastructure, application, and data integration monitoring across multiple systems and hundreds of data source types, which have large amounts of data to be processed.

The company’s e-commerce platform gives visibility into every process, enabling them to anticipate potential threats that may affect daily transactions. 

After using another monitoring platform for a few years, Lenovo decided to upgrade to a more adaptable solution that could customise observability across operations to better respond to changing consumer preferences. Since Lenovo had already been using the Splunk Data-to-Everything Platform for IT operations and security for years, turning to Splunk for observability was a natural progression.

With visibility into their end-to-end stack, Lenovo is now able to work across their data landscape to better understand how the infrastructure behaves across different services, using machine learning analytics to anticipate and address end users’ problems.

Plugging in Splunk

Lenovo’s back-end systems help the company scale up for events and holidays, which usually bring about bottlenecks as more people access their website and mobile app. On Black Friday 2020, for instance, Lenovo offered an assortment of doorbuster deals on computer products, while giving away a limited number of gaming products as incentives.

While the company had expected a surge in sales and web traffic, the spike turned out to be 300% higher than the same period in 2019. With Splunk Observability Cloud, Lenovo’s saw 100% uptime, with zero outages or digital crises, and it provided a consistent shopping experience despite the large increase in traffic.

Lenovo also benefits from having centralised, customisable analytics dashboards that collate and analyse transactions in real time. As a result, the average time it takes to recover from a system failure has now gone from 30 minutes to about 5 minutes. The decrease in mean time to resolution (MTTR) leads to faster troubleshooting and helps the productivity of DevOps teams. This ultimately lowers revenue loss for the organisation.

The Splunk user interface provides a few benefits as well. For instance, technical data is more readily understood through graphical forms like Splunk’s Service Map, which creates a pictorial topology that displays the relationship between different services created for a certain application. This helps make collaboration better amongst DevOps teams who are building services for the same application.

Splunk also integrates alerts via Microsoft Teams, providing access to real-time processing and the analysis of multiple data sources across platforms and systems. Heatmap visualisations of Lenovo’s infrastructure and traffic light status reporting increases visibility amongst DevOps teams, and the host monitoring function lets the team capture real-time status updates of each server.

Problems with siloed data

Dhiraj Goklani, Area Vice President of IT & DevOps, APAC, Splunk. Image courtesy of Splunk.

According to Dhiraj Goklani, Area Vice President of IT & DevOps, APAC at Splunk, tools used by IT and DevOps teams to monitor and manage applications and infrastructure are typically disconnected, often separated into two or three different platforms. “These tools don’t interact with one another, thus providing separate, and possibly contrasting data,” he said.

“With the rapid shift to cloud infrastructure, as well as the deployment of emerging technologies such as artificial intelligence (AI), Internet of Things, and machine learning, IT and DevOps teams are also wrestling with increased operational complexity. This complexity is compounded by too many existing monitoring tools that have blind spots, disparate data sources, and disjointed workflows,” observed Goklani.

He added: “It is also important to note that metrics, traces, and logs aren’t solutions, but redefinitions of the original problem – that of observability. Simply defined, observability uses systems and applications to measure the internal states of a system as inferred from the unified knowledge of metrics, traces, and logs. This functionality then translates the external outputs to provide a comprehensive understanding of systems, their workload, and their behaviour.”

For examples of blind spots, siloed data, and disjointed workflows that current systems have, Goklani noted that disconnected data systems circumvent system inefficiencies and integration limitations, resulting in increased manual processes. “When each system operates independently, multiple sets of data require monitoring, troubleshooting, and response, most of which may overlap with one another and result in increased hours of manual data management,” he said.

Goklani observed that siloed data systems create enterprise-wide blind spots as teams access separate software systems to obtain and make sense of data. 

“Multiple systems that do not share data mean data duplication, wasted time, and the possibility of error. Multiple software systems also translate to weak security measures, and a single platform protected by the cloud-based security measures is more efficient in safeguarding business data. When data is duplicated across multiple software systems with varying levels of data protections, businesses risk exposure to hackers and the possibility of cyberattacks,” he said.

Tools for enterprises

Goklani said that their Splunk Enterprise Security – an analytics-driven “Security Information and Event Management” product solves the above issues by allowing enterprises to monitor, troubleshoot, respond to changes, and handle performance bottlenecks at any scale.

When it comes to insights that enterprises can draw from Splunk Observability Cloud, Goklani had this to say: “It provides access to real-time analytics and visualisations in one place without sampling. Splunk Observability Cloud is an analytics-powered, enterprise-grade solution designed to give customers visibility into their infrastructure, applications, and users. It provides IT and DevOps teams a unified observability platform for monitoring, troubleshooting, and response.”

“OpenTelemetry data at scale is consumed and managed – an observability framework comprising vendor-neutral APIs, software development kits, and tools to collect telemetry data from cloud-native software and applications. The real-time architecture enables context-rich investigations, helping IT and DevOps teams to resolve issues for a decrease in MTTR,” he added.

Splunk Application Performance Monitoring, on the other hand, is an AI-powered tool that lets enterprises analyse, ingest, and store trace data for later use with no sampling, which is said to shorten time spent in troubleshooting. “Combined with infinite cardinality, enterprises can examine errors and latency across all the tags of any given service,” he said.

Goklani explains that the tool’s AI-driven analytics automates the correlation between application performance, critical business KPIs, infrastructure, and end-user experience. He adds that it provides meaningful alerts – not a barrage of notifications – and highlights root causes to problems.

For advice on how to give IT and DevOps teams dealing with operational complexity, Goklani suggests that they invest in observability solutions to supplement their ongoing activities and ensure cloud environments run smoothly.

“COVID-19 has been a catalyst, greatly accelerating digital transformation, including cloud migration, app modernisation and the development of net new direct-to-consumer cloud-native applications. This translates to the growing complexity of applications and cloud infrastructure, in addition to increased data flows and security concerns, combined with teams still working remotely – these are creating more challenges than ever, making observability essential,” he concluded.