How Coles and TfNSW illuminated blind spots through Splunk

Image courtesy of Davyn Ben

As enterprises are increasingly unlocking the power of big data, artificial intelligence, and machine learning, blind spots have become unacceptable for business organisations hoping to survive in a cut-throat digital world.

Take for example Australian supermarket and retail chain Coles, which saw about 5,000 of its support centre staff suddenly needing to work from home during the onset of the COVID-19 pandemic. This entailed not only access and security issues for the employees, but also monitoring and evaluation for its management team.

“We had to scale up our VPN to accommodate those 5,000 users. And we had a requirement to monitor how many users are working from home over VPN, (as well as) the performance of the VPN,” said Dinendra Wickramasinghe, Senior Security Engineer, Coles.

“Initially, our engineers had to get these statistics from VPN devices manually every four hours, put them into a table, and email them to management. Then, we evolved (it) from a manual exercise into Excel form. We started entering this data into Excel, then (utilising) some better visualisation for management,” he added.

Transport for New South Wales (TfNSW) also experienced the same predicament, and for a mission-critical service provider, the remote working arrangement challenged their observability capabilities.

“When employees are working from home, the security concerns have become more (serious), and this has actually given us a lot of insights into security monitoring with the various routines that we use across the transport cluster,” said Ravi Teja, Senior Architect at TfNSW.

Staying on top of things

With the volume of data coming in each day, both organisations knew they had to have better optics on their whole IT ecosystem, not only for corrective measures, but for predictive analysis as well.

“One thing we’ve never been short on is the volume of rich raw data spewing forth from our security and non-security infrastructure. But the masses of log data was like wading through molasses. Mining and cross correlation from multiple data sources was frustrating. It could take hours or even days to massage the normalising of the logs of data to achieve an outcome,” said Steven Russell, Manager, Security Technologies, Coles.

This is where the supermarket chain found Splunk’s observability platform useful.

“The way we wanted to use it is (as) a reactive forensic analysis engine. A means by which you pick over the entrails of evidence to build a case towards root cause on an implied event, or even a post impact,” noted Russell.

The company then began feeding SNMP (Simple Network Management Protocol) data into Splunk, and began visualising them using their dashboards.

“At any given time, we know exactly how many users are logged in to the VPN devices, and we can also actually perform trend analysis. It shows how many multifactor-authenticated users have been made, RSA users logged in, and how many SSL VPN users are logged in and so forth, so we get the full granularity in here,” explained Wickramasinghe.

Before Splunk, TfNSW was having a hard time monitoring bushfires. Even with the massive resources involved in data gathering and response operations, the agency was still experiencing blind spots.

“We actually worked on a DR (disaster recovery) mechanism to move all our assets back onto a DR mode. We are now trying to enhance that a bit more by having other adverse conditions back onto Splunk, like flooding, power shortages, and any kind of inclement weather that happens. So we are trying to push all the data back onto Splunk, and predict and prevent any such things happening by using this monitoring,” said Teja.

With the real-time problem visibility and end-to-end situational awareness Splunk enabled for the agency, TfNSW executives can easily look at the whole picture and quickly notice if something is off.

“We started our journey from reactive mode to proactive mode. Now, we are in good shape where we have started looking at predictive analytics, as well as a preventive analytics with a lot of machine learning algorithms that Splunk has worked with, (as well as) the correlation of multiple systems, especially the database layer, the web layer, the infrastructure layer, or the hosting layer and the network layer, including some of the security events and the correlations across this,” Teja added.

All hands on deck

During the Sydney New Year’s Eve celebration in 2021, TfNSW was put to task because of the heavy influx of tourists planning to watch the capital city’s fireworks display.

The job was to ensure that IT and OT services are in tip-top shape to ensure the riding public that transport services are always up and running.

“What we have done is used the monitoring services to keep up the health scores of every single asset, which are ensuring that the stations are healthy. At the same time, trains are on time and helping the public to move in and out. What we have done to achieve that is end-to-end service monitoring,” noted Teja.

“The applications that are really important for the train stations to be active and the trains to move, those are monitored. If there is any drop in the health score, or if there is any incident that has been identified by any other tool, and that comes on to the Splunk dashboard, a red light immediately pops up,” he added.

All of these, Teja said, are made possible by automated alerting. However, each event doesn’t necessarily trigger an incident.

“Whenever an incident or an event alerting starts, we have a noise reduction, because we have a lot of monitoring tools that start triggering the alerts whenever the issue is identified. What happens when an alert pops up there? We need some noise reduction, because we don’t want every alert to be an incident. We have got various rules, and the rules engine runs on call. And this on-call engine will ensure that the noise is reduced before even an event or an alert is triggered as an incident,” he explained.

Observability is key

As enterprises strive to keep up with new technology, new sets of challenges emerge on a regular basis. Gary Steele, President and CEO of Splunk, observed three major enterprise challenges: security, digitalisation, and complexity.

“We’re living in an increasingly unpredictable world, (wherein) threat actors are well-funded. They’re patient, they’re persistent, and over the past few years, think of all the events we’ve dealt with, from ransomware to supply chain attacks, to zero-day vulnerabilities,” he said.

As for digitalisation, Steele said that despite the uncertainty brought about by the pandemic, customers are expecting more out of their digital experience, and businesses, in turn, are rapidly responding.

“Research firm IDC recently published a statistic, saying that they anticipate 750 million new applications (to get) deployed between 2023 and 2025. I think it means that we’re all going to be working for software companies. And interestingly, those systems that we’ve had, that have supported the customer experience, are really now up front, taking and handling the customer transaction itself,” he explained.

Because organisations are scaling up the way they do business, the amount of assets that require monitoring also multiplies.

“Whether you’re dealing with a hybrid world, or you’re incorporating on-premises capabilities, whether you’re in a multi-cloud world, all that brings more complexity into your environment, you have more services to monitor, you have a much broader attack surface. And you have many more points of potential failure. And all of this is happening while you’re trying to do more,” said Steele.

“Now, just consider a system that isn’t secure, that can’t be relied upon for mission-critical workloads. An application that doesn’t load quickly will just be abandoned by users. And finally, an organisation that spends time firefighting clearly doesn’t have the bandwidth to truly innovate. You need this foundation of security and resilience, so that you can innovate with speed and agility,” he added.

Meanwhile, Russell shared some enterprise takeaways from the past eight years that Coles has used Splunk.

“When implementing Splunk, invest in the preparation of the architecture. All the components, such as indexers, forwarders, and search heads need to be respectfully dimensioned. Very importantly, make sure you bake into your operational expense the right measure of ongoing effort to maintain the environment. Splunk, like all technologies, benefits from regular servicing and tuning. The point is, make sure you assign people to take care of it, and bake it into their KPIs. It will pay back in the long run,” he recommended.