How (and why) Kafka was created at LinkedIn

Image courtesy of Ross Sokolovski.

Behind every great technology lies an almost unbelievable origin story, and Kafka, the open-source distributed event streaming platform embraced by 80% of the Fortune 100, emerged from within the labs of a company that people rely on for job hunting. 

Developed in 2010 by Jay Kreps, Neha Narkhede, and Jun Rao while working at LinkedIn, Kafka soon became an open-source platform in early 2011.

In 2014, these three Kafka founders came together to establish Confluent— a cloud-native platform with Kafka at its core, aiming to solve organisational woes in managing Kafka deployment.

In this first instalment of an exclusive, two-part interview with Frontier Enterprise, Jun Rao, co-founder of Confluent, details his personal journey before, during, and after the creation of Kafka.

Could you share about how you developed Kafka during your time at LinkedIn, and perhaps a little bit before that period?

I started my career at IBM, and that was the first job I got right after graduate school. The group I joined at IBM was a database group. Initially, I did some core database work, particularly with IBM’s database product called Db2, for about five years. Around 2006, I got exposed to a lot of open-source technologies. It was during that time that Google and Amazon disclosed a considerable amount of their internal technology on how they deal with big data. This marked the release of MapReduce and Bigtable, which garnered significant attention from the open-source community.

Jun Rao, co-founder of Confluent. Image courtesy of Confluent.

Earlier, those technologies were only available inside those tech companies, but through open sourcing, their implementation gained popularity within the broader community. Now, anyone outside of those tech companies can readily access and utilise them. As part of a research group at IBM, I had the opportunity to explore and comprehend this newer type of technology, which sparked my involvement in some open-source initiatives.

One of the key differences between these emerging technologies and IBM’s database lies in their approach to performance optimisation, because the throughput is different. In the early stages at IBM, our focus predominantly revolved around fine-tuning processes within a single machine, striving for incremental 5% improvements. However, many of the technology companies behind these innovations approached data differently, developing their systems as distributed systems. They continuously enhance performance by initially designing for a single machine, then they build it in a distributed manner to leverage multiple machines in parallel. This mindset makes achieving high throughput relatively straightforward. There’s no need to continually squeeze out an additional 5% while the machine’s temperature rises. I find this fundamental difference in system design quite interesting.

Around 2010, that’s when I was looking for another opportunity outside IBM, which eventually led me to join LinkedIn. Upon joining LinkedIn, my initial project involved working on Kafka. At that time, Jay (Kreps) had conceived the initial idea for Kafka and had made some preliminary progress in its implementation. However, it had not yet been released for production. Neha (Narkhede) and I collaborated to finish the first version of Kafka’s implementation, which was subsequently deployed within LinkedIn.

What was Kafka initially used for?

LinkedIn designed Kafka with the purpose of addressing persistent data challenges. Around that time, LinkedIn witnessed a tremendous growth in the volume of digitised information it accumulated. Initially, this data comprised traditional transactional records such as educational and job histories, as well as user connections—comprising the transactional portion of the data within LinkedIn. However, over time, LinkedIn realised that these transactional records constituted only a small percentage of its overall data. User behaviour data played a crucial role as well. For instance, even if a user didn’t modify their profile, actions such as viewing someone else’s profile or entering specific keywords reflected intentional behaviour and indicated the user’s interests. Understanding this information proved invaluable in comprehending user needs from LinkedIn’s perspective. There was a substantial amount of such data to consider, stored across various locations.

Furthermore, the number of platforms and systems requiring access to this digitised information grew significantly. LinkedIn had embraced newer data services, including next-generation data warehouses like Hadoop systems, Core Data Lake, and various search engines. Additionally, a multitude of data-driven applications and microservices—numbering around 300 to 400—had been developed by LinkedIn during that time.

How did Kafka solve LinkedIn’s problems?

LinkedIn realised that merely storing digitised information doesn’t really generate value on the platform.  The data only generates value when you leverage it. Thus, the faster the flow of information to the relevant destinations, the quicker LinkedIn can derive new insights from it. To achieve this, it was essential to integrate all the data together and deliver it to the relevant locations where it could be effectively utilised. However, upon examining the traditional data infrastructure commonly employed by most organisations at that time, LinkedIn recognised that the existing solutions did not effectively address this problem.

Around that time, two types of typical data infrastructure were commonly used:

  • The first type consisted of various databases. Many databases were primarily designed to handle data at rest, focusing on capturing and storing information. However, the usage of that data typically occurred much later, often taking days or weeks due to the limitations of the technology available at that time. 
  • The second type of system comprised various messaging systems, which had been in existence for over 20 or 30 years. These messaging systems differed by facilitating data in motion, enabling immediate action when specific events occurred. This characteristic provided the opportunity to leverage data more rapidly. 

However, the biggest challenge lay in the fact that traditional messaging systems were not designed to handle scalability. LinkedIn realised that a system capable of accommodating a 1,000x growth in digitised information was necessary. These systems were typically designed as single-node systems, and when subjected to 100 or 1,000 times more data, they just break.

Because of these two mismatches, there was no existing infrastructure that adequately met LinkedIn’s needs. This prompted the birth of Kafka. We built Kafka from the ground up to address this specific problem and provide a new piece of infrastructure for handling real-time data at scale.

Jay, Neha, and I spent about a year building the first version of Kafka. It was crucial for the initial version to handle the exponential growth of digitised information, which necessitated designing it as a distributed system from the outset. As new events are happening, downstream systems could subscribe to Kafka to leverage the data. This approach completely changed LinkedIn’s data architecture.

With Kafka serving as the central hub for integrating digitised information from various sources in real time, it became the foundation for feeding downstream use cases, including data services and numerous microservices. These components continually monitored LinkedIn’s activities, enabling real-time responses. 

With these advancements, we essentially provided an opportunity for every developer at LinkedIn to actively respond to real-time business events. As a result, we were able to build a lot of products that are much more engaging to the user because the information is much fresher. We also unlocked many new opportunities that were previously unattainable, because without real-time capabilities, many of these opportunities would have been lost.

The second part of this exclusive interview with Confluent co-founder Jun Rao can be read here.