Stream Processing: The Future of the Modern Data Stack

Real-Time Data Revolution

Stream Processing: The Future of the Modern Data Stack

Today marks approximately 10 years of my experience building infrastructure with Kafka and its ecosystem.

In 2014, I began using the Apache distribution of Kafka at SAP Ariba while building cloud infrastructure as the organization transitioned from a SOA architecture to a microservices-based architecture. Soon, we were deploying Confluent's Kafka distribution to production, onboarding teams to leverage Kafka as the central nervous system for the microservices. The reason for this transition was that Confluent's distribution included Schema Registry and several other proprietary features that were more effective for running Kafka in production.

Witnessing the rise in Kafka adoption both internally and externally, I joined Confluent in 2017. There, I had the opportunity to build developer productivity and platform infrastructure for Kafka, Connect, KSQL, Cloud, and the rest of the engineering organization. At Confluent, I witnessed firsthand the rise in adoption of the overall Kafka ecosystem.

The Evolution of Database Models from 1960 to 2010


In the 1960s, network-based and hierarchical models for computing and storing databases rapidly gained popularity among organizations. Examples include CODASYL, IMS, and SABRE.

In 1970, E.F. Codd's seminal paper introduced the relational database, revolutionizing database thinking by separating the database schema—the logical organization—from physical data storage. This innovation set the standard for modern database systems.

Between 1974 and 1977, two significant prototypes of relational database systems emerged: Ingres, developed at UBC, and System R, created at IBM San Jose. Ingres employed the QUEL query language, paving the way for systems such as Ingres Corp., MS SQL Server, Sybase, Wang’s PACE, and Britton-Lee.

System R introduced the SEQUEL query language, shaping the development of SQL/DS, DB2, Allbase, Oracle, and Non-Stop SQL. This era also marked the recognition and adoption of the term Relational Database Management System (RDBMS).

In 1976, Peter Chen proposed a new database model called Entity-Relationship, or ER. This model enabled designers to prioritize data application over the traditional focus on logical table structure.

Structured Query Language, known as SQL, was adopted as the standard query language by the American National Standards Institute in 1986 and by the International Organization for Standardization in 1987.

Since then, we have witnessed a significant evolution as the popularity of RDBMS rose in contrast to existing network-based and hierarchical-based data management tools. In subsequent years, a multitude of products emerged, including DB2, PARADOX, RBASE 5000, RIM, dBASE III and IV, OS/2 Database Manager, and Watcom SQL.

The 1990s played a crucial role in advancing databases and database software, particularly towards application development. New client tools like Oracle Developer, MySQL, PowerBuilder, and Visual Basic (VB) were released. Additionally, tools enhancing developer productivity such as ODBC, Excel, and Access were developed. Prototypes for Object Database Management Systems (ODBMS) emerged in the early 1990s. By 2010, there was a notable shift in how data was generated and utilized, both within organizations and externally, with row-based and column-based databases finding market acceptance with NoSQL solutions.

Modern Data Stack

From 2010 to the present day, the modern data stack embodies a comprehensive ecosystem of products crafted for efficient data management and analytics. Products in this category typically encompass robust components such as cloud storage solutions like Amazon S3 or Google Cloud Storage, providing scalable data storage. Data pipelines are orchestrated using tools such as Apache Kafka or Apache Airflow, ensuring smooth data ingestion and processing. Platforms like Snowflake or Google BigQuery provide scalable and high-performance solutions for data warehousing and analytics.

The stack is agile and scalable, accommodating diverse data sources and formats, including structured, semi-structured, and unstructured data. Real-time analytics capabilities are enhanced through technologies like Apache Flink or Apache Spark Streaming. Overall, the modern data stack empowers organizations to extract actionable insights, drive innovation, and maintain competitive advantage in today's data-driven world.

There is an evolutionary surge of products in this category, with solutions like ClickHouse, Atlan, MonteCarlo, Telmai, AirByte, RockSet, Materialize and numerous others gaining significant popularity.

Revolutionizing Data Efficiency: Stream Processing and Streaming Stores

Stream processing holds the promise of efficiently handling data with low latency and compute costs, even at massive volumes and high cardinality. With the aid of streaming stores, they eliminate the necessity for additional computations during data ingestion and can operate directly on the data source across multiple streams.

In addition to Kafka, emerging streaming stores like Redpanda and Warpstream are increasingly gaining traction. Today, companies like DeltaStream, RisingWave, Decodable, Arroyo etc are developing products aimed at simplifying the complexity involved in building stream processing applications. With each new evolution of a software stack, it commoditizes the stack that came before it.

As Stream processing improves efficiency by processing data closer to its source with low latency and handling high cardinality, it will surpass existing use cases currently served by the modern data stack by the end of this decade.

The adoption of AI/ML across organizations will drive this transformation. AI/ML applications demand real-time data processing for timely decisions and predictions. Stream processing facilitates handling continuous data streams with low latency, crucial for dynamic environments such as financial trading, IoT networks, and real-time recommendations.

In summary, stream processing is indispensable for AI/ML applications to achieve real-time analytics, adaptive decision-making, and operational efficiency across various industries.

“Data streaming is the central nervous system for data, while AI/ML algorithms are the brain. ~ Confluent”