Revealing the true cost of data: Confluent CTO’s insights

Image created by DALL·E 3.

With large amounts of data come greater costs, particularly with enterprises experimenting with AI, which demands significant compute power and storage.

During Confluent’s Kafka Summit in Bangalore, CEO Jay Kreps and CPO Shaun Clowes discussed why organisations are overwhelmed with their current data infrastructure. Meanwhile, CTO Chad Verbowski sat down with Frontier Enterprise to explain why data costs are spiralling out of control, and how new advancements in analytics can address this issue.

Before Confluent, you worked at eBay, then Microsoft, then Google. How has big data evolved since that time?

I think my career actually started at networking. I was always interested in scale problems. Human factors often created issues when it came to scaling. In networking, it was about networking devices. So, how do you eliminate that?

Data was also a scale problem. Back then, we addressed this by buying increasingly larger hardware. There was something called an HP Superdome that looked like a big table. This approach stemmed from the software and the way people thought about it — just building bigger and bigger units. By transitioning to a parallel system, and later to a cloud environment with infinite parallel resources, we could approach problems very differently. That shift was a significant turning point in how things have evolved. So, that was one big change that I saw.

Some of the problems that you were dealing with at eBay, like misclassification of products and bad customer experience, still persist today. How do you solve those from a modern data streaming perspective?

Chad Verbowski, Chief Technology Officer, Confluent. Image courtesy of Confluent.

The challenge is understanding the quality of data, which you typically assess offline. Whether you have a bunch of listings on eBay or a bunch of search results in Bing, it doesn’t matter much. You generally get a signal about the quality of the data as soon as people interact with it.

What we found is that the sooner you can get information about these interactions, the sooner you can improve things. Usually, you can improve them within a user session. You can make corrections, update results, and so on. The more real time you can make it, the better the experience.

The trouble is, if you try to build that experience using legacy systems, they can only move so fast. The cost of doing things quickly usually gets very high, and then there’s the issue of increased complexity the faster you go. Streaming changes that approach. It assumes that everything is real time and asks, “Now, how would you build those systems?” I think the AI aspects of the feedback loop are going to drive this change. That’s why we’ve done a lot of work to integrate AI with Flink—Flink with Kafka, et cetera—because it’s not enough to have just one element be real time and streaming. The whole system has to be streaming.

During the presentation, it was mentioned that the performance of Confluent Cloud is 16x that of Apache Kafka. Where does that massive performance differential come from?

When you talk about performance, the first element at play is the networking that connects the client producing data to the service receiving it. If you’re running it on-premises, you’re somewhat limited by the available networking and the hardware in that rack. In the cloud, there have been significant advancements in optimisations. Once you collect data at the network layer, the next step is processing it to ensure it’s stored exactly once, with all the transactional pieces in place. When you run this on-premises, you’re limited to the capabilities of that machine.

The third part is writing data to the physical disks in that machine through a RAID (redundant array of independent disks) array or something similar.

In the cloud, we can break these steps apart. We actually don’t run the same code as we do on-premises. In the cloud, for example, you don’t have a single hard drive or one machine. We use a process called disaggregation, where we split that single system into multiple layers and optimise those layers independently. A lot of our performance gains come from this approach. These interactions are handled very differently in the cloud. That’s why we have an engine called Kora. We write things differently because we’re in the cloud, allowing us to take advantage of many optimisations.

How do you foresee the entry of AI into this whole equation? Will there be a complete fusion of the operational estate and analytical estate in the distant future?

I think it is already here. Too much time has been spent managing data, and it’s very primitive, almost like caveman days, in terms of collecting and working with data. This is due to the model of pushing data up and pulling it through various processes, creating separate piles at each step.

The efficient way is to just deal with the changes as they happen in those deltas. With AI’s evolution, we’re finding that lower latency in the data used for predictions dramatically improves the quality of the outcome, more so than using a more complex model. Building more detailed models is very expensive.

With streaming, you would typically produce all the pieces for your model offline — you collect data, tune it, try things out, and eventually create a system that works. The problem is that this system exists in the analytical state. You then have to pick it up, push it over, and plug it into your operational state to use it. This process of picking it up and moving it over is very complex because the systems are different. The way models are built and how you interact with them is very hard.

Improving this process means not having to transfer data repeatedly. With streaming, data comes in, and you can still read legacy data as if it were batch data, work with the model, and serve requests within the same system. These processes can take advantage of real-time data, enhancing the quality of your results.

AI is evolving towards streaming-first usage. The main limitations are the cost, scale, and performance of stitching these other systems together—that process of picking it up and moving it over. For example, integrating with a vector database or processing attributes can be expensive. Some queries might cost US$1 each, which adds up quickly. Just using these models isn’t sufficient if it takes minutes or hours to get data to improve your models. If you’re limited in the diversity of new data or the number of dimensions you can consider in a request, your quality will be constrained. This is why we see a convergence around these elements.

How does the Kafka-Flink-Iceberg ecosystem intend to disrupt the way analytics is done?

Even if it’s not those three specific technologies, you’re going to have something that collects your data. It’s probably Kafka, but it could be something else. Something collects the data and writes it into a data lake, which is one of those transitions between systems. You’re going to process it, read it back out of that system, and do something with it, or put it inside your favourite data warehouse. Imagine if those three were actually one system: If you join them together, with the data that comes in, you can do things like keep it in memory, so while you’re working with it, you don’t have to read it back from disk. You don’t have to ensure it’s copied and durable somewhere else. When working with it from the analytics space, they can now read that single copy.

Verbowski engages with his team on solutions for data streaming and analytics. Image courtesy of Confluent.

The biggest problem is networking and storage costs. Are you going to keep multiple copies of this — one in Kafka, one in your data lake, and a third in your data warehouse? How do you keep them all in sync? Which is the authoritative copy? Again, human management and operations come into play, and when those come in, it’s costly, limiting who has access to it. Therefore, a lot of this is about using the systems you have today and making them much simpler. Because they’re cheaper, you can afford to have everybody work with them. That’s why we invest in things like the data portal and make it very discoverable.

What’s the most exciting thing Confluent is working on right now?

The most exciting piece is the integration with processing and data collection. Kafka isn’t just a tool for bringing in data and interacting with it as a durable store state for coordinating. The functionality of working with data along the way is super incredible. Imagine you have some data coming through, and you can automatically attach some AI to it.

For example, there’s a check that says, “Is this column likely to be correct?” Let’s say it’s a postal code or an address. Imagine knowing that upfront, and anyone in your company or even globally could run that check. With the click of a button, you could have that appear as an extra column in your messages as they’re generated. Downstream, you could see that it’s, say, 50% likely correct and prompt the user while they’re still entering the data, asking, “Are you sure?”

With AI, you could suggest corrections or provide smart suggestions based on additional information like the user’s IP address. This ability to ensure data correctness as it enters the system makes everything downstream more reliable. The data warehouse becomes more powerful, and the data is more trustworthy. That’s the trend I am very excited about.