The global e-commerce boom owes much of its success to the rise of digital payment platforms, and vice versa. As people and companies rely on e-wallets to make their transactions more convenient, the pressure is on these platforms to stay on top of everything.
This means that a software glitch or an unprotected API could spell disaster for companies like DANA Indonesia, because customers expect their funds to be accessible 24/7.
Fortunately, the Indonesian digital payments platform has taken a proactive approach by adopting a new strategy to anticipate and address untoward incidents even before customers become aware of them.
Previously, DANA struggled with siloed observability, which prevented them from obtaining a comprehensive, end-to-end view across user interactions within their app, server-side components, and infrastructure.
Norman Sasono, the Chief Technology Officer at DANA Indonesia, noted that this hindered their ability to promptly resolve issues.
“Identifying the problem’s origin, pinpointing the affected area, and subsequently recovering from system failures or incidents took longer than expected. This had a direct impact on our MTTR (mean time to recovery) metric, which is an indicator of our system’s resilience. Ultimately, it also impacted the quality of services and the user experience (UX) for our customers,” he recalled.
Seeing the whole picture
To resolve their situation, DANA quickly scouted for a technology partner capable of providing complete observability, eventually deciding to work with SaaS company Splunk.
At first, integrating Splunk with DANA’s existing infrastructure proved to be quite a challenge, noted Dhiraj Goklani, VP of Observability, APAC, Splunk.
“DANA’s application stack was not something that we typically supported out-of-the-box. However, we worked closely with the DANA team and utilised the capabilities of OpenTelemetry to create a workaround, which eventually resulted in successful instrumentation,” he explained.
Overall, implementing the Splunk Observability Cloud at DANA proceeded smoothly, Goklani added. During the proof-of-concept phase, before purchase, approximately 25-30% of the use cases were already operationalised.
With DANA integrating its SaaS applications and their hosts with Splunk application performance monitoring (APM) and information management, the payments platform was able to harness the benefits of APM usage, thereby expanding its observability capabilities.
Likewise, DANA also onboarded its mobile native application to Splunk’s real-user monitoring, which is said to have resulted in observability across its entire application ecosystem.
“With these integrations in place, our engineering team has gained invaluable insights into our applications’ performance and infrastructure,” shared DANA’s CTO Norman Sasono. “This enhanced visibility enables us to pinpoint the root causes of incidents more effectively, including specific applications or APIs that may be contributing to issues.”
Armed with this knowledge, DANA’s team can now promptly address problems, improve incident response times, and optimise overall system performance.
“These integrations have become indispensable tools for driving continuous improvement, as well as ensuring the reliability and stability of our applications and services,” the CTO continued.
Sasono highlighted the following reasons for partnering with Splunk:
- No sampling of data involved due to 100% observability.
- Compliance with open standards such as OpenTelemetry.
- Provides end-to-end observability capability instead of partial.
Despite the implementation of numerous security solutions and strategies, organisations cannot guarantee immunity from cybercriminals.
In DANA’s case, the company experienced a bot attack and several other security issues. However, because of DANA’s use of Splunk’s platform, the APM played a crucial role in determining the user-agent or source of the attack.
“This invaluable information greatly influenced our decision-making process on the security front, enabling us to take appropriate measures such as blocking the identified threats. By leveraging the capabilities of the APM, we swiftly identified and addressed potential risks, ensuring the integrity and safety of our systems,” DANA’s Norman Sasono said.
With the help of Splunk’s monitoring tools, DANA is able to identify the root cause of any issue and resolve it within 15 minutes, according to Splunk’s Dhiraj Goklani.
“At the same time, DANA is taking proactive measures to prevent the same problems from happening in the future. Such anomalies previously took DANA hours or days to uncover. Overall, DANA’s MTTR for casual errors is now 70-90% faster since its team enabled custom index tags, allowing them to know the error code without checking the logs,” Goklani noted.
In addition to its security benefits, Splunk APM has also provided DANA with a deeper understanding of its applications and APIs.
“We have gained valuable insights into their performance, allowing us to identify areas that have improved compared to previous benchmarks and areas that have experienced a decline. This knowledge drives us to continuously enhance our systems, motivated by our commitment to progress and avoiding complacency,” Sasono said.
Gazing at tomorrow
Moving forward, the digital payments platform aims to increase its observability maturity, as well as attain true “No Ops” and “AI Ops” approach.
Further, DANA is banking on achieving full automation of its systems and processes while enhancing its capabilities to detect problems early.
One particular feature DANA is exploring is the implementation of “self-healing” mechanisms for incidents, glitches, or system issues.
“These abilities will improve and shorten our MTTR, and subsequently enhance the overall quality of service (QoS) for users, including availability, reliability/success rate, performance/response time, and scalability. With better QoS, users will enjoy an enhanced UX, resulting in a more seamless and frictionless experience when customers use our platform. This will drive user retention and growth, and ultimately revenue generation,” revealed DANA’s Norman Sasono.
Meanwhile, Splunk, via its latest State of Observability Report, found that more IT practitioners are increasingly combining observability with other monitoring practices, with security being the most common, compared to the previous year.
“In response, we have further integrated observability and security in our unified platform. This includes increasing service level visibility and having more coverage for hybrid and multi-cloud environments. We’re also enhancing security and compliance measures to help developers be more productive within Splunk’s platform,” Splunk’s Dhiraj Goklani said.
Additionally, as AI and ML have become essential parts of the observability toolset, It is important to ensure that the adoption of AI/ML in observability tools remains authentic and avoids AIOps-washing, Goklani stressed.
“Despite the industry’s buzz surrounding AI, we have a responsibility to our customers to assess and provide analysts with the necessary context before sharing our AI plans, especially amid the current levels of uncertainty around the risks and challenges associated with adoption,” the Splunk executive said.
Finally, Splunk is actively pursuing the development of a scalable enterprise platform powered by OpenTelemetry, an open-source observability framework that remains vendor-agnostic.
“OpenTelemetry will continue to work even as new technologies emerge, unlike commercial solutions that necessitate vendors to build new integrations for interoperability. This is especially relevant in the age of AI and ML, when new innovations and technologies, such as ChatGPT, are coming online at a rapid pace,” Goklani concluded.