Introduction

Real-time data ingestion is a fundamental requirement for modern analytics platforms. Applications continuously generate events such as transactions, logs, user activity, and IoT sensor data that need to be captured and analyzed with minimal delay.

Apache Kafka is one of the most widely used distributed event streaming platforms, while ClickHouse® is a high-performance analytical database built for real-time analytics. Together, they form a powerful combination that enables applications to stream data into ClickHouse® within seconds.

In this article, we'll build a Kafka-to-ClickHouse® ingestion pipeline using ClickHouse®'s built-in Kafka Engine and Materialized Views, complete with practical SQL examples.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform designed for high-throughput, fault-tolerant data pipelines. Applications publish events to Kafka topics, and consumers read those events independently.

Some of the core Kafka concepts include:

Topic – A named stream of events (for example: orders, logs, or page_views)
Producer – An application that publishes events to a Kafka topic
Consumer – An application that reads events from a topic
Partition – A topic can be divided into multiple partitions to enable parallel processing and scalability

Kafka acts as a durable event buffer between applications and downstream systems.

Why Use Kafka with ClickHouse®?

Kafka and ClickHouse® complement each other exceptionally well in modern real-time analytics architectures.

Kafka handles reliable event ingestion, while ClickHouse® provides extremely fast analytical queries over large datasets.

This combination is commonly used for:

Real-time dashboards
Log analytics
Application monitoring
User behavior analytics
Clickstream processing
Fraud detection
IoT telemetry pipelines

Instead of relying on scheduled ETL jobs, events flow continuously from applications into Kafka and are automatically ingested into ClickHouse®.

Architecture Overview

A typical ingestion pipeline looks like this:

Application (Producer)
        │
        ▼
    Kafka Topic
        │
        ▼
ClickHouse Kafka Engine Table
        │
        ▼
 Materialized View
        │
        ▼
MergeTree Storage Table
        │
        ▼
Analytics & Dashboards

The pipeline consists of three main ClickHouse® components:

Kafka Engine Table – Connects ClickHouse® to a Kafka topic and consumes incoming messages.
Materialized View – Automatically transfers data from the Kafka engine table into a storage table.
MergeTree Table – Stores the ingested data permanently for fast analytical queries.

Once configured, data flows automatically without requiring manual ETL jobs.

Prerequisites

Before starting, ensure you have:

A running ClickHouse® instance
A running Apache Kafka cluster (local or cloud)
A Kafka topic receiving messages
Basic understanding of SQL and Kafka
Kafka Engine support enabled in ClickHouse® (enabled by default)

Step 1: Verify Kafka is Running

First, verify that your Kafka broker is available.

Check broker status:

kafka-broker-api-versions.sh \
--bootstrap-server localhost:9092

List available topics:

kafka-topics.sh \
--list \
--bootstrap-server localhost:9092

Create a topic if one doesn't already exist:

kafka-topics.sh \
--create \
--topic orders \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 1

Step 2: Produce Sample Messages

For this example, we'll create an orders topic containing JSON events.

Start the Kafka console producer:

kafka-console-producer.sh \
--topic orders \
--bootstrap-server localhost:9092

Publish the following sample messages:

{"order_id":1001,"customer_id":101,"country":"IN","status":"completed","amount":1200.00,"order_date":"2024-01-05 10:00:00"}

{"order_id":1002,"customer_id":102,"country":"US","status":"pending","amount":450.00,"order_date":"2024-01-06 11:00:00"}

Step 3: Create the Storage Table

Create the MergeTree table that will permanently store incoming events.

CREATE TABLE default.orders
(
    order_id UInt32,
    customer_id UInt32,
    country LowCardinality(String),
    status LowCardinality(String),
    amount Float64,
    order_date DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (country, status, order_date);

This is the table your dashboards and analytical queries will use.

Step 4: Create the Kafka Engine Table

Next, create a Kafka Engine table that connects directly to the Kafka topic.

CREATE TABLE default.orders_kafka
(
    order_id UInt32,
    customer_id UInt32,
    country String,
    status String,
    amount Float64,
    order_date DateTime
)
ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'orders',
    kafka_group_name = 'clickhouse_orders_consumer',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 1;

Kafka Engine Settings

Setting	Description
`kafka_broker_list`	Kafka broker address
`kafka_topic_list`	Kafka topic to consume
`kafka_group_name`	Consumer group identifier
`kafka_format`	Format of incoming messages
`kafka_num_consumers`	Number of parallel consumers

Note: Kafka Engine tables are designed for streaming ingestion only. They should not be queried directly for analytics.

Step 5: Create the Materialized View

The Materialized View continuously reads messages from the Kafka Engine table and inserts them into the MergeTree storage table.

CREATE MATERIALIZED VIEW default.orders_mv
TO default.orders
AS
SELECT
    order_id,
    customer_id,
    country,
    status,
    amount,
    order_date
FROM default.orders_kafka;

Once the Materialized View is created, the ingestion pipeline becomes active.

Every new message published to the orders topic will automatically appear in the storage table.

Step 6: Verify Data Ingestion

Query the storage table to verify that data has been ingested successfully.

SELECT *
FROM default.orders
ORDER BY order_date
LIMIT 10;

Expected output:

order_id	customer_id	country	status	amount	order_date
1001	101	IN	completed	1200.00	2024-01-05 10:00:00
1002	102	US	pending	450.00	2024-01-06 11:00:00

Count the total rows:

SELECT count()
FROM default.orders;

Monitor Kafka consumer status:

SELECT *
FROM system.kafka_consumers
WHERE database = 'default'
AND table = 'orders_kafka';

Step 7: Query the Data

Once the pipeline is active, the storage table behaves like any other ClickHouse® table.

For example:

SELECT
    country,
    count() AS total_orders,
    sum(amount) AS revenue
FROM default.orders
GROUP BY country
ORDER BY revenue DESC;

This enables dashboards, reports, and real-time analytics over continuously arriving data.

Supported Kafka Message Formats

ClickHouse® supports multiple Kafka message formats through the kafka_format setting.

Some commonly used formats include:

JSONEachRow
JSONCompact
CSV
TabSeparated
Avro
Protobuf
Parquet

Among these, JSONEachRow is the most widely used because it is easy to read, easy to produce, and supported by most Kafka producers.

Scaling Kafka Consumers

As event throughput increases, multiple Kafka consumers can ingest data in parallel.

For example:

kafka_num_consumers = 3

A good practice is to match the number of Kafka consumers with the number of topic partitions.

For example:

3 partitions → 3 consumers
6 partitions → 6 consumers

Adding more consumers than partitions will not improve throughput because the additional consumers remain idle.

Quick Reference

Component	Purpose
Kafka Topic	Source of streaming events
Kafka Engine Table	Reads events from Kafka
Materialized View	Automatically inserts data into storage
MergeTree Table	Permanent analytical storage
`system.kafka_consumers`	Monitor consumer status and lag

Final Thoughts

Integrating Kafka with ClickHouse® provides a simple yet powerful solution for building real-time analytics pipelines.

By combining the Kafka Engine, Materialized Views, and MergeTree tables, ClickHouse® can continuously consume streaming events and make them immediately available for analytical queries.

Whether you're building operational dashboards, monitoring systems, clickstream analytics, or event-driven data warehouses, this architecture offers a scalable and production-ready foundation with minimal operational overhead.

As your event volume grows, you can scale Kafka partitions and ClickHouse® consumers independently while maintaining fast ingestion and high-performance analytical queries.