Military-grade, real-time data science

Nima Negahban, Co-Founder and Chief Technical Officer of Kinetica, talks about active analytics and the growing importance of CIOs as data science becomes the heart of every enterprise’s growth strategy.

Image courtesy of Kinetica.

Tapping advanced analytics in real time is the holy grail for every enterprise. In 2010, this was the aim of the US military, which asked Kinetica co-founders Amit Vij and Nima Negahban to build a “future-forward” purpose-built project from the ground up. The aim was to consume the over 250 real-time data feeds that they had access to, and be able to deliver a common query API to analysts and developers so that they could rapidly create capability and put it in the theater for use by soldiers. The project later became Kinetica, which has since infused that technology into enterprises across the globe and through the entire spectrum of verticals.

We sat down with Nima Negahban to discuss the big data challenge in enterprises, from hardware and querying to visualization and business use.

Kinetica got its start designing military-grade analytics and visualisation for massive geospatial datasets. Tell us a little about that.

The US military ran into the same pitfalls that we see today with enterprises that are trying to do this.

One is expensive developer TCO because any alteration, added capability or operational orchestration essentially meant making a new database.

The second is massive hardware fan-out, because with indexing structures one way they tried to get performance gains at the time of query is data replication. So the cluster size grew and grew.

And finally, most importantly, as the query inventory grew, you had to do more elaborate indexing, which created a massive latency gap. So the real-time nature was lost, and index instructors were falling farther and farther behind.

Where we came in was that we had a lot of experience with GPUs, and in that same year – 2010 – the Fermi chipset had come out, which was a watershed moment for the GPU as far as performance and capability went.

Instead of designing a database for compute as a scarce resource, we designed for computers as an abundant resource, and optimized for leveraging compute in a distributed fashion and being able to consume data and query data simultaneously. We abstracted away from the developers so that they don’t have to augment their queries or their data models to get the scale and performance that they need – then they can focus on their applications and the actual analytics.

We were lucky to be a part of that project, because really, it was like an enterprise trying to do something that we see modern commercial enterprises do now. You see many enterprises now going through the same journey, and they’re realizing, hey, this is, this is not as easy as we had thought.

One of the things that forced us to have a very rich location pedigree is that one of the focuses of that project was to relate almost everything to space and time. So there was a massive amount of location data. Our ability to use the GPU to do very high resolution filtering of that GPU data and then do OLAP on top of it is really what set us apart in that project.

How do you see Kinetica compared to competing solutions?

Where we really differentiate is when you need to be consuming data continuously and doing complex queries simultaneously. The [SAP] HANAs of the world and the Teradatas of the world are great data warehouse products. Where they really start to struggle is if you start to continually ingest data and do those complex queries at the same time. We label that whole space as passive analytics, where essentially, I’m going to ingest my data overnight, run my report after I ingest, and then the report is made and I’m done, and everyone’s happy. Or I have a very fixed query inventory, and I’m going to make intermediate tables as I adjust data, and my analysts are happy.

There’s obviously a huge customer base for that. But more and more, we’re seeing enterprises realize, “Hey, it’s not good enough for this to be 24 hours old, I know need to know now, it needs to be up to the minute, and I need to be able to give my analysts ad hoc ability.” Because as they’re turning through stuff, the analysts are going to have ideas, and then they’re going to be stuck, basically, because they don’t have the query capability. We’re doing that scan off the granular data, and rolling it up on the fly. That allows them to be more creative in the moment, and do a much more fruitful kind of feedback cycle where they can go in any multitude of directions without having to call data engineering and say, “Hey I need you to do this roll-up for me.”

What advice would you give a CIO or CTO just beginning their data journey?

The number one thing is start backwards. Define what your business objective is – really understand what you want to do with the data. Are you trying to create a new set of product capabilities, or increase top line revenue, or increase efficiency? Understand that, and from there, work backwards to understand what you need to build in your infrastructure to achieve those goals. Often, that can be an active analytics platform, it can be a more classic data warehouse, plus an ETL layer like Kafka to land and transform data. But ultimately, you have to understand what is the goal you’re trying to achieve from a business perspective. Because at the end of the day, the CIO or CTO is a service provider to the business. So the business goal has to be aligned, and then you can work it backwards.

Where do you see the intersection of massive data analytics with machine learning and AI?

Right now, there’s a huge investment happening in making model development easier for data scientists. One part that’s being ignored in the model development vector is getting access to data in an operational timescale.

Quite often, you see data scientists doing model development off of old data, and sometimes that’s okay, but sometimes they would benefit from being able to do large-scale feature generation and data extraction from the data as it is at that moment.

The second vector is around the operational capabilities that are being given to data scientists once they have built the best model. Often what I see is, they hand it over to their sys/admin team with some instructions: “Run this Python script and it’ll put out these scores every night,” or whatever it might be. But there’s no operational framework that treats ML and AI and housing models as a heavyweight problem, meaning not just managing the process as a process, but also capturing the interactions that you are imprinting on, capturing the inferences, the feature generation, the data that you’re doing feature generation upon, and also making it incredibly easy for a data scientist to do deployment of their model in an operational way.

So with our 7.0 release, you can import a model as a container, then you can press a button, and we’ll wrap a RESTful service around it. They can give that RESTful service to their application development team, and as the developers use the model and ask for predictions or scoring or whatever it might be, the analysts can analyze and search and audit the activity of the model in real time. So it’s not just about managing this process – it’s actually thinking about what you need to capture in an AI context that is truly going to give operational insights to the data scientists.

Do you think we’re getting to a point where data science is at the center of the enterprise’s operations?

Right now data science is still a little nascent — it’s not yet at the center of most enterprises driving second-by-second decisioning. I thought we’d be here by now, quite honestly. But it is going towards that.

As enterprises get more comfortable and more confident around how to leverage the modeling techniques that they’ve built with their data science teams, the question is naturally going to be how do they operate these models on a day-to-day basis in a way that gives the data science team the real operational visibility that they need, so that if something does go wrong, that they can have the tools and capability they need to respond.

What’s keeping enterprises from getting there sooner?

One thing is skills – having your application developer, your data engineer and your data scientist being able to easily work with one another, and having the management data-engineering folks understand how to operationally host these, and how to operationally educate and enable the rest of the developer community within the enterprise on how to leverage these models. That enablement is still ongoing in enterprise, but it’s really more of a people and skills problem than anything else. It’s having IT management and the broader enterprise management understand AI better, what it can and can’t do.

There’s also a learning that needs to be occurring from the systems engineering community – folks like myself, we’re building platforms to really understand what AI is, what it can do, what it can’t do, and how we can better enable the data scientists. The folks that have done that to date, the focus has all been on model development. And as that tool chain matures, it’s going to go into the operational side, and that’s where we’re focused.

Do you think Google open-sourcing TensorFlow helps in that regard?

TensorFlow is the flagship AI framework out there, in my opinion. TensorFlow is unique in the sense that Google is thinking operationally as well.

The TensorFlow serving project gives you a hint of that. They only go so far, but you can see that the folks that have been trying to build AI into the center of their core from the beginning – like the Googles and the Ubers – they understand that there’s an operational side to this that you have to make repeatable and sustainable.

How do you see enterprise IT evolving over the next five to ten years?

I think this is just the beginning. As every CEO realizes that they are going to become a data company, and have a digital transformation journey, the CIO and the CTO become ever more important, because now they are not just affecting daily operations – they’re affecting their growth strategy. Fundamentally, the CEO needs to be working with the CIO and the CTO to have a very in-sync, well-defined and aligned growth strategy around their digital transformation initiative. So as the top-line management get it more and more, and realize that this is where their growth is going to come from, it’s only going to intensify the budgets, the attention, the progress, the demand, and ultimately the capabilities that IT needs to deliver in each of these enterprises. It’s an exciting time.