What is Apache Kafka?

What is data streaming?

Data streaming means continuous flow of data from one or more sources. It is a high-speed data transfer, typically in real-time or near real-time. Streaming data is generated from various sources like IoT devices, social media feeds, user web activities/logs, financial transactions, news feeds, etc. The goal of streaming data is low-latency processing.

Apache Kafka is an open-source data streaming platform. It was designed by LinkedIn in 2011. The goal of this system is to handle constant loads of streaming data, process it sequentially and incrementally.

Apache Kafka works on a publish-subscribe model. It effectively stores the streaming data in order and processes it in real time. The primary use of this system is real-time data pipelines and streaming applications. It provides high tolerance, high -throughput, scalable messaging system.

Even though Apache Kafka is a messaging system, it differs from traditional messaging systems.

Kafka is a combination of both messaging queues and publish-subscribe model as it uses partitioned log model.
It has policy-based message retention. Users can configure the retention window. The message gets deleted after consumption in the traditional messaging system.
Kafka allows partitions across different servers which leads to scalability.
Because of partitioned log architecture, consumers receive information in order.
Multiple consumers can subscribe to same topic/message in Kafka. In traditional system, message is removed after it is consumed by one consumer which makes it unavailable for another consumer.
In Kafka, topics are automatically replicated. In traditional system, it must be configured for automatic replication.

How does Apache Kafka work?

The above diagram shows the workflow of data streaming system using Apache Kafka. Kafka blends both queuing and publish-subscribe messaging systems to offer strength of both systems.

Queuing allows many consumers to distribute the load, making it great for scaling. But it doesn’t support multi-subscribers.
Publish-subscribe model is a multi-subscriber model, but it cannot distribute work among them.

Kafka solves this by using a partitions log model. A log refers to a sequence of records. These logs then split into partitions for different subscribers. Each consumer reads from different partitions of the same topic. It allows both scalability and multi-subscriber approach.

Kafka also allows them to reprocess or re-read past messages from the topic. So different applications can read the same data independently at their own pace.

Advantages of Apache Kafka:

High Throughput: Kafka can handle millions of records per second making it ideal for high volume and fast environment.
Real-time Processing: Because of high throughput, Kafka supports low latency which makes it suitable for real-time analytics and monitoring.
Scalability: Kafka can be scalable horizontally. It means that, based on workload, multiple brokers, partitions, consumers can be added.
Fault Tolerance: Kafka offers replication of data. So, even if any server/node fails, data will be available for processing. It ensures high availability and durability.
Replayability: Kafka allows customers to reprocess or re-read messages any time as it stores the data based on retention policy.
Message Order: Because of Kafka partitioning, messages can be ordered based on partition key. It is useful for many applications like financial transactions, logs etc.

Disadvantages of Apache Kafka:

High Resource Utilization: For heavy load and large retention period, Kafka utilizes significant resources like memory, CPU, etc.
Limitation on Messaging Order: Kafka supports message ordering but only within partition. It is complex to set up ordering across all partitions.
Complex Setup: Setting up Kafka processes like broker, partitions, replication etc. can be complex. It needs to be set up carefully.
Message Transformation: Kafka doesn’t support message transformation internally. It has dependency on external tools.
Security Configuration: Even though Kafka supports security features like encryption, authentication etc., setting them can be complex.
Storage Cost: Kafka can store messages based on a retention policy which can cause high storage costs.

Use Cases:

Real-time Data Streaming: This is the most common use case of Kafka since it is ideal for real-time data ingestion and processing of the data. It was the original use case of Kafka. LinkedIn uses it for user activity tracking, real-time feeds etc. Other use cases are live stock market data for stock trading platform, supply chain and logistics data to optimize routes and track shipment etc.
Real-time Data Analytics: Kafka streams live data to analytics engine. It helps to make fast and informed decisions.
Event-drive Architecture: Kafka supports Event-Driven Architecture. It produces, consumes and processes events in real-time. It helps to build complex, event driven applications.
Log Aggregation: Kafka allows collection of logs from various sources, stores and analysis of logs. It allows low-latency processing and supports multiple data sources. It offers good performance and durability.
Metrics: Apache Kafka can aggregate metrics from distributed applications so that it can have centralized feeds of operational data which is used for operational monitoring data.
Fraud Detection: Kafka plays key role in real-time fraud detection by streaming financial transactional data like credit card purchases, fund transfer etc, continuously to machine learning model. These models help to detect unsual patterns or anamolies to detect fradulent behavior.