Imply’s Field CTO talks databases

Image courtesy of Stephen Dawson

In this age of information, analytics are a big deal. It’s essential for every enterprise to utilise one so they can identify trends and make the necessary adjustments.

For organisations or developers looking to create their own analytics applications, they’ll need powerful tools that enable real-time workflows. Many turn to Apache Druid, an open-source database that can quickly ingest large amounts of data. Instead of relying on community forums, other businesses instead choose to work with Imply, a software company founded by the creators of Apache Druid.

To learn more about Imply – specifically its analytics platform – Frontier Enterprise interviewed Eric Tschetter, the company’s Field CTO. It was Tschetter who wrote the first lines of code for Druid. Over the years, he has consistently worked with the open-source database throughout his previous roles at Splunk and Yahoo Inc.

Imply’s database is built from Apache Druid. What’s the difference between Apache Druid and Imply? How is Imply different from other types of column-oriented databases?

Imply delivers the complete developer experience for Apache Druid. Founded by Apache Druid’s original creators, Imply seeks to add to the speed and scale of the database with committer-driven expertise, easy operations, and flexible cloud deployment to meet developers’ application requirements.

Apache Druid is a real-time analytics database that helps developers build interactive data experiences at TB-PB+ scale, for any number of users, and across streaming and batch data. Traditional columnar databases like Vertica, Snowflake, and BigQuery are architected for business intelligence and reporting characterised by infrequent queries on batch data, whereas Apache Druid is built for modern analytics applications with 100s to 1000s of queries per second and optimised for streaming data.

The Imply database is aimed at developers looking to stream and batch data. From a complexity and scale standpoint, what are the challenges that developers regularly deal with to make this happen, and how does Imply address it?

There are three key technical challenges that developers face with analysing streaming data:  processing millions of events per second, minimising the latency between ingestion and analysis, and ensuring consistent data quality.  At its core, Druid was designed for streaming data and it uniquely addresses these challenges.

For scale, Druid is a powerful and flexible architecture that scales to 10s of millions events per second. Confluent, for example, has an ingestion pipeline with >3 million events per second and leverages Druid for both internal observability and external analytics.

For latency, Druid supports query-on-arrival and minimises the time between an event and when you can query the data for the strictest SLA (service-level agreement) for data freshness.

Lastly, for data quality, Druid overcomes the challenges of dropping or duplication of streaming events at scale.  Druid supports exactly-once semantics and continuous backup so that in the event of a node failure, Druid can restart the stream ingestion exactly where it left off without any data loss ever.

You co-authored Apache Druid in 2011. What are the most significant changes you’ve seen in databases since then?

Eric Tschetter, Field Chief Technology Officer at Imply. Image courtesy of Imply.

The largest change I’ve seen since then is the expansion of data usage in-product.

Back in 2010, 2011, people were still primarily focused on just how to store and manage the large amounts of data coming from their web properties and machines. They had data warehouses built for the physical world that they were trying to map onto the digital world.  This gave birth to the NoSQL movement which, turns out, was poorly named. That movement was more of a challenge of building all the assumptions into databases up until that point with the goal of figuring out what needs to change in order to allow digital-age data to be used in an outward-facing fashion, i.e. directly in product.

The companies that were doing this at the time were major technical powerhouses: Google, Yahoo!, LinkedIn, etc. Since then, “NoSQL” has effectively come to an end, everyone has realised that the SQL query language is ubiquitous enough that you might as well figure out how to work with SQL. However, the ideas of how to make systems distributed, how to horizontally scale out, and what trade-offs can be made to expose a meaningful product experience on top of significant amounts of data, those things have stayed.

This is also the space that we’ve seen Druid solidify itself inside of. We built Druid initially trying to put vast amounts of data directly into a product experience and those roots have proven themselves out as that is how we see Druid used both in the community and in Imply’s customer base.

Before joining Imply as its Field CTO, you were a Fellow at Splunk. What specific lessons learned there are you able to apply at Imply? What was the most interesting part of working at Splunk?

Splunk has an amazingly successful story about how they were able to take a technology that was ahead of its time, identify use cases that both enabled a community of users around its proprietary product as well as solved real problems in security, and convert that into a repeatable business model.

Before joining Splunk, I had been very focused on R&D, i.e. building things, without as much of an understanding of connecting those things with a larger value proposition. Getting to experience part of Splunk’s story was extremely eye-opening in how technology can be leveraged to do more and greater things.

This taught me very clearly that the best relationship with a customer is one where you are aligned on delivering the same value proposition. If that alignment doesn’t exist, nobody is happy. In the end, this translates into my role at Imply, where I ensure that we are focusing on the final customer value proposition and that we validate our fit with our customers to ensure that we have a positive long-term relationship.

What predictions do you foresee in the databases and analytics sectors for the next five years? How will emerging technologies like 5G, AI, and ML affect their evolution?

Starting with a 10-year time horizon, I believe that we will see convergence in the analytics sectors around 3 main pillars:

  1. Streams
  2. Data warehouses
  3. Data applications

That convergence might start five years from now, two years from now, or tomorrow, that I don’t know; but in 10 years, the data powerhouses will combine all of these capabilities together into a uniform package.

I don’t expect emerging technologies like 5G, AI or ML to disrupt the data infrastructure market. Specifically, 5G just means more data, AI and ML are things you do on top of data infrastructure, once you already have the data available.

If anything, I think that 5G disrupts the connectivity space, and AI/ML will disrupt the actual application space, where the needs of these technologies will push people more and more towards data infrastructure that can handle the increasing data volumes and power the AI/ML necessary for their products.

What are some of the most exciting tech developments that Imply is working on? What emerging technologies do you plan to adopt?

We’ve been heads down on not just one, but two major product initiatives. We recently announced Project Shapeshift, a one-year initiative to redefine Apache Druid and the experience for developers. While 1000s of companies are already using Druid, our goal is to greatly expand what Druid can do and make it easier for more developers to get the power of it without having to be an expert.

We made two recent announcements:

  • First, the introduction of a multi-stage query engine that extends Druid’s architecture beyond interactivity into reporting and alerts on a single database;
  • Second, the availability of Imply Polaris, the cloud database service built from Apache Druid.

Together, these developers show how Imply is delivering the most developer-friendly and most capable database for analytics applications.