Data has exploded in sheer size during the past decade. Thanks to the growth of storage capacities, cheaper storage devices, and faster computers, enterprises today are facing an exponential increase of information that needs to be processed.
All this data, however, is useless if enterprises can’t generate any insight from it. The relationships between data points are far more significant than the data itself.
Enter the graph database, which is built to store and navigate relationships. As long as data exists and there is a path between them, any questions can be answered – even unexpected ones.
California-based TigerGraph is one of the companies currently offering graph database services. One of the authors behind the TigerGraph platform is Dr Yu Xu, the organisation’s founder and CEO, who specialises in a sizable list of data-related fields such as parallel computing, database management, Hadoop, and MapReduce.
Frontier Enterprise recently sat down with Dr Xu to talk about the origins of TigerGraph, his experience in Teradata, why graph databases are popular right now, what he thinks about the competition, and more.
Could you take us through your work at Teradata in Hadoop and MapReduce, and how that impacted what you decided to do later? How did TigerGraph start?
My background is in databases. I got a PhD (in computer science and engineering) from the University of California San Diego. So, I love databases. I want to do databases.
I joined Teradata before Hadoop. Teradata is a really high-end, expensive, data warehousing company, and it has the best technology money can buy. When Hadoop came along for the first time, people could get access to distributed file systems to MapReduce campaigns, so they can do at least some type of analytics on huge amounts of data. That was exciting, but to me, it wasn’t because the MapReduce model is too simplified. It’s not even as powerful as a relational database.
Back to the question: Teradata was not my first job. My first ever job was in a startup in San Diego, founded by my ex PhD advisor. That’s around 2000, and it was really exciting but it didn’t go so well because of the dot-com bubble burst, so I returned to school to finish my PhD.
At Teradata, I got to work with a lot of big customers like eBay, Walmart, and Bank of America. I noticed that sometimes, some of the BI (business intelligence) reports took a few days or weeks to finish. I saw that it’s kind of a graph problem. If we have a native graph database, the performance could be 1,000 times faster.
I was not excited about Hadoop. I know it’s really valuable to a lot of people, but on the technology side, in terms of challenge, I didn’t see it as the right direction to go. Google then started the whole Hadoop movement because it published some papers around its own distributed file system. But Google also said the disputed MapReduce framework does not work for a graph type of computation.
For example, Google is famous for PageRank, right? PageRank is an algorithm used to rank web pages in their search engine results. Each iteration is computed until it can decide which website is more important than other websites. For this type of PageRank, a type of computation, traditional relational base or MapReduce does not work at all. Google said, they have to teach native graph computation in order to get as good a performance. This excited and inspired me.
When I joined Twitter, I moved out from San Diego to the San Francisco Bay Area, to have more exposure to talent and venture capitalists. That was around 2011, before its IPO. I joined Twitter when it had 600 people. By the time I left, it had a few thousand employees. At that time, Twitter was big on the interest graph, which highlights who people should follow on the platform, and what topics they would be interested in. But Twitter, like any other internet company, doesn’t care about other enterprise uses like banking. They focused on solving their own in-house problem. Similarly, for Google and Facebook, they had their own graph solution, but it only worked for their own problem. It’s not a general purpose tool.
For example, in around 2012, Google was doing Knowledge Graph. If you search for some movie actors, you will be shown other similar actors, or the family members of this actor. They do everything in-house, their infrastructure is in house. Google can use graph processing, but they do not care about other retailers and banks. They’re an internet company, they’re in the advertising business and not in the database business. So the engineers are problem solvers who want to solve their own internal problem.
Could you take us through the growth of TigerGraph? How did you start? How did the whole commercialisation of your graph database concept come along?
It’s kind of a cliche; I started in my garage with friends I know.
Building a database is challenging. Building a distributed graph database is even more challenging, because building any distributed database is tough. It leads to working optimisation, but even so for graphs. Why? Because graphs are supposed to connect dots, and sometimes they connect random dots to reveal patterns. That makes it super challenging to do distributed graph processing.
That’s why in the first five years or so, TigerGraph was under stealth mode, because we were building our product. We didn’t need salespeople, we didn’t need marketing, because the product is not ready. I have an engineering background, and I don’t want to promote something that is not ready, especially in the database business. In the long term, the product will win.
In the first five years, we built everything from the ground up, basically, from scratch. We used C++, which is really the best choice for doing databases, because you control how you use your memory, and how you return the memory to the machines if you don’t need it anymore.
But it’s harder to hire talent because most people use Python and Java, and it’s easy to learn Java and Python than coding lower-level languages like C++. We made the difficult choice, but it’s the right choice.
Over the last four or five years, we just worked on our product and worked with a couple of bigger customers. But in order to make sure this can go mainstream, we had to make sure we have a really easy-to-use, high-level query language. When we got all this ready, we felt we were ready. In late 2017, we announced the product’s general availability to the whole world.
We also changed the name from GraphSQL to TigerGraph for branding. The original name was meant to be more like a database. We also wanted to be controversial initially, because at the time, people loved Hadoop and NoSQL. I wanted to be different, and to stand out.
Gartner predicts that 80% of data and analytics innovations will be made using graph technology by 2025, versus just 10% in 2021. Why is it that graph databases are only coming out now?
Graphs, graph concepts, graph theory, and graph databases are not new. TigerGraph is not the first one to do this. If you look at the smartphone’s history, you can make a similar comparison. Before the iPhone, we had a Blackberry or Nokia, which a lot of people loved, but most thought it’s a niche market. When the iPhone came out, it was so powerful that it could run all kinds of applications like social media and stock trading. That’s when everybody realised that the smartphone is not a niche market anymore, it’s for everyone. Now, everybody needs an Android phone or iPhone, and they’re mainstream.
That’s how it’s like for graph databases. Before TigerGraph, you had legacy databases, the first-generation and second-generation databases, which a lot of people loved. They tried to use it in prototypes and small data sets, but it’s not scalable on the enterprise level, because these were mostly for single-machine architectures and not for parallel processing.
That’s why a lot of people say, ‘Okay, graphing is a nice concept as a niche market.’ If you really want to use graph databases, you have to do it in-house and code for a particular problem, or they just hike around the simulation at the base or other key venue or, a MongoDB type of document database.
You mentioned the TigerGraph was not the first race. There are also competitors like Neo4j and Amazon Neptune. How do you outstrip them in terms of innovation?
Amazon actually did a great job to educate the market about graph databases. The problem is that Amazon bought their product. They purchased a 10-person company about four years ago to integrate the graph database into their system, but its performance is not good. Their graph database has a single-machine architecture, it’s not distributed, and it only works for Amazon.
TigerGraph works on-premises, on Amazon, Google GCP, and any other cloud. We have the advantage of being more flexible. Of course, we’re distributed. And even on a single machine, we’re more powerful. In terms of return on investment or total cost of ownership, TigerGraph is just ahead of Amazon Neptune.
We also have a lot of customers who have a deep relationship with Amazon. The CIO and CTO requirement is that before you try anything else, you have to try Amazon first. If you want to use other companies’ services, you have to clearly justify why. We have a customer that switched from Neptune to TigerGraph, and the performance difference is just so dramatic. Otherwise, they would not be able to use non-Amazon services.
The second part of this exclusive interview with TigerGraph’s CEO can be read here.