Why are so many AI projects stuck in the pilot phase? While AI holds enormous potential, many enterprises are quick to adopt it — often without a clear data strategy.
Shaun Clowes, Chief Product Officer of Confluent, has seen where things go wrong. In the first of a two-part feature, he shared his observations with Frontier Enterprise during the Confluent Current 2025 conference in Bangalore.
First off, how do you see the role of data warehouses and data lakes evolve?
I don’t think they’re going away. Organisations have invested a lot of time and money to create a global view of their data — to support good business decisions and understand what’s happening. I don’t see a world where there’s no need for data warehouses or data lakes. They’re a critical part of the ecosystem.
The challenge is, if those are the only places where you can see all of your data, how do you actually build good things? Data lakes and warehouses are almost always powered by batch ETL or ELT. The data is often very old, because it’s joined across multiple batches — and it’s only as fresh as the oldest one.
They’re also not always high quality. The data might land in poor formats, or be transformed in ways that no longer reflect the original. So while data lakes and warehouses are powerful, they just don’t meet the needs of trying to put data to work, like when you’re trying to solve real business problems with data.
The rise of generative AI and LLMs has really highlighted this gap. It’s become obvious that the moment you invoke an LLM — the moment you prompt it — is the only context it has. Whatever you put into that window determines whether the outcome is good or bad.
That makes it critical to feed in all the most important enterprise data — everything about the customer, the employee, the issue, the delivery — whatever the business problem is. You have to provide the best possible data instantly. And once you have the output, you need to act on it, pushing it back into the right systems across your ecosystem.
Data warehouses and lakes will remain important for analysis. They’re where humans engage with data — to understand it, make decisions, and forecast the business. But increasingly, some of that processing will shift left: higher data quality from the moment it’s created, and more processing earlier in the pipeline, so the data stays real time.
As data gets closer to the edge, will it become more important?
All kinds of data exist to be acted upon. Some of that data is reviewed by humans looking at reports and making decisions. Some are manually keyed in and responded to by a customer service agent on the phone. What computers enable is the ability to react in real time.
Traditionally, that meant writing applications with lots of code to take in events and take action. AI agents dramatically accelerate that. You can now make computers do very smart things, very quickly. These are things that previously would have taken months or years of engineering effort.
If we’re going to generate more of these systems, more software products that sense the world, react to it, and act back on it, we have to figure out how to do that at scale, and do it safely.
Take medical devices, for example. You’re reacting to medical signals. You want to ensure that whatever’s coming out the other side is reliable, because it could directly affect a person’s health.
Or consider an AI workflow handling package deliveries and logistics. You need to avoid accidentally routing packages to the wrong place, overstocking a warehouse, or filling a delivery vehicle that is already moving things around.
On one hand, it is easier than ever to put data to work. On the other, if we don’t do this right, it could become incredibly chaotic, with agents doing all sorts of unintended things. So we have to take this great power we have been given, feed it the best possible information, and apply it in the safest way.
Are you saying that with great risk comes great reward?
What is interesting about AI is that so many of these projects are stuck in pilot mode. Ask yourself, why are they stuck in pilot mode? I would argue the most obvious reason is that it is easy to build one of these systems, but hard to bring the data it needs instantly, at scale.
For example, I’m developing a simple use case for an airline. Let us say I want to rebook a passenger who’s missed their flight. In the lab, it’s easy to build an LLM that takes in the customer’s details, the current flight, the destination, and the other available flights. It processes that information and outputs a decision on what flight to move them to. That part is easy.
But in practice, it needs to work perfectly every time. That means it must know the passenger information at the very moment it makes a decision. It must know which flights and seats are available. It must know your flight details and your frequent flyer number — all at that exact instant. If any of that is missing, it can behave unpredictably. It might not work at all, or it might make the worst possible decision. It could try to book you on a flight that does not exist, or fail to move your bags because it does not think you have any. The best and worst outcomes can come from exactly the same code.
When you write traditional code, it doesn’t work like that. The result is usually average, generally good, or at least reliably so — whether reliably good or reliably bad.
Except in rare cases, AI is nothing like that. It’s far less deterministic. So people are starting to realise they need to ask: How do I ensure that at the instant I invoke this LLM, all the required data is complete? Because if I cannot do that, it will not work. And it’s not going to work at scale.
As a result, people have to monitor the output. Let us say I deploy an AI rebooking agent. I still need to watch how it performs. The model might change. The input data might change. Either could cause the system to start doing crazy things.
I have to monitor the output, and I need a way to introduce new agents in case I discover better information. I need to be able to put in a new agent that competes with the existing one, check if it performs better than the old agent, and eventually retire the old agent. Therefore, you end up with complicated deployment and data management challenges around them.
It strikes me that all AI problems are, in the end, data problems; they’re data management problems. Organisations have spent a long time avoiding those. They assume that if they can make their batches slightly faster, or run more of them into the data warehouse and just work harder, they can sidestep rethinking their approach to data movement and management.
But I think it is inevitable. More data needs to move in real time. More of it has to stream. More of it needs to operate in the live operational domain. Because AI systems are ultimately apps. They’re real-time apps. Whether you like it or not, they’re not reports, they are not dashboards — they’re apps. So you need to figure out how to feed them, manage them, understand them, develop them, and evolve them.