Introduction
Real-time data ingestion is a fundamental requirement for modern analytics platforms. Applications continuously generate events such as transactions, logs, user activity, and IoT sensor data that need to be captured and analyzed with minimal delay.
Apache Kafka is one of the most widely used distributed event streaming platforms, while ClickHouse® is a high-performance analytical database built for real-time analytics. Together, they form a powerful combination that enables applications to stream data into ClickHouse® within seconds.
In this article, we'll build a Kafka-to-ClickHouse® ingestion pipeline using ClickHouse®'s built-in Kafka Engine and Materialized Views, complete with practical SQL examples.
What is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform designed for high-throughput, fault-tolerant data pipelines. Applications publish events to Kafka topics, and consumers read those events independently.
Some of the core Kafka concepts include:
-
Topic – A named stream of events (for example:
orders,logs, orpage_views) - Producer – An application that publishes events to a Kafka topic
- Consumer – An application that reads events from a topic
- Partition – A topic can be divided into multiple partitions to enable parallel processing and scalability
Kafka acts as a durable event buffer between applications and downstream systems.
Why Use Kafka with ClickHouse®?
Kafka and ClickHouse® complement each other exceptionally well in modern real-time analytics architectures.
Kafka handles reliable event ingestion, while ClickHouse® provides extremely fast analytical queries over large datasets.
This combination is commonly used for:
- Real-time dashboards
- Log analytics
- Application monitoring
- User behavior analytics
- Clickstream processing
- Fraud detection
- IoT telemetry pipelines
Instead of relying on scheduled ETL jobs, events flow continuously from applications into Kafka and are automatically ingested into ClickHouse®.
Architecture Overview
A typical ingestion pipeline looks like this:
Application (Producer)
│
▼
Kafka Topic
│
▼
ClickHouse Kafka Engine Table
│
▼
Materialized View
│
▼
MergeTree Storage Table
│
▼
Analytics & Dashboards
The pipeline consists of three main ClickHouse® components:
- Kafka Engine Table – Connects ClickHouse® to a Kafka topic and consumes incoming messages.
- Materialized View – Automatically transfers data from the Kafka engine table into a storage table.
- MergeTree Table – Stores the ingested data permanently for fast analytical queries.
Once configured, data flows automatically without requiring manual ETL jobs.
Prerequisites
Before starting, ensure you have:
- A running ClickHouse® instance
- A running Apache Kafka cluster (local or cloud)
- A Kafka topic receiving messages
- Basic understanding of SQL and Kafka
- Kafka Engine support enabled in ClickHouse® (enabled by default)
Step 1: Verify Kafka is Running
First, verify that your Kafka broker is available.
Check broker status:
kafka-broker-api-versions.sh \
--bootstrap-server localhost:9092
List available topics:
kafka-topics.sh \
--list \
--bootstrap-server localhost:9092
Create a topic if one doesn't already exist:
kafka-topics.sh \
--create \
--topic orders \
--bootstrap-server localhost:9092 \
--partitions 3 \
--replication-factor 1
Step 2: Produce Sample Messages
For this example, we'll create an orders topic containing JSON events.
Start the Kafka console producer:
kafka-console-producer.sh \
--topic orders \
--bootstrap-server localhost:9092
Publish the following sample messages:
{"order_id":1001,"customer_id":101,"country":"IN","status":"completed","amount":1200.00,"order_date":"2024-01-05 10:00:00"}
{"order_id":1002,"customer_id":102,"country":"US","status":"pending","amount":450.00,"order_date":"2024-01-06 11:00:00"}
Step 3: Create the Storage Table
Create the MergeTree table that will permanently store incoming events.
CREATE TABLE default.orders
(
order_id UInt32,
customer_id UInt32,
country LowCardinality(String),
status LowCardinality(String),
amount Float64,
order_date DateTime
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(order_date)
ORDER BY (country, status, order_date);
This is the table your dashboards and analytical queries will use.
Step 4: Create the Kafka Engine Table
Next, create a Kafka Engine table that connects directly to the Kafka topic.
CREATE TABLE default.orders_kafka
(
order_id UInt32,
customer_id UInt32,
country String,
status String,
amount Float64,
order_date DateTime
)
ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'orders',
kafka_group_name = 'clickhouse_orders_consumer',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 1;
Kafka Engine Settings
| Setting | Description |
|---|---|
kafka_broker_list |
Kafka broker address |
kafka_topic_list |
Kafka topic to consume |
kafka_group_name |
Consumer group identifier |
kafka_format |
Format of incoming messages |
kafka_num_consumers |
Number of parallel consumers |
Note: Kafka Engine tables are designed for streaming ingestion only. They should not be queried directly for analytics.
Step 5: Create the Materialized View
The Materialized View continuously reads messages from the Kafka Engine table and inserts them into the MergeTree storage table.
CREATE MATERIALIZED VIEW default.orders_mv
TO default.orders
AS
SELECT
order_id,
customer_id,
country,
status,
amount,
order_date
FROM default.orders_kafka;
Once the Materialized View is created, the ingestion pipeline becomes active.
Every new message published to the orders topic will automatically appear in the storage table.
Step 6: Verify Data Ingestion
Query the storage table to verify that data has been ingested successfully.
SELECT *
FROM default.orders
ORDER BY order_date
LIMIT 10;
Expected output:
| order_id | customer_id | country | status | amount | order_date |
|---|---|---|---|---|---|
| 1001 | 101 | IN | completed | 1200.00 | 2024-01-05 10:00:00 |
| 1002 | 102 | US | pending | 450.00 | 2024-01-06 11:00:00 |
Count the total rows:
SELECT count()
FROM default.orders;
Monitor Kafka consumer status:
SELECT *
FROM system.kafka_consumers
WHERE database = 'default'
AND table = 'orders_kafka';
Step 7: Query the Data
Once the pipeline is active, the storage table behaves like any other ClickHouse® table.
For example:
SELECT
country,
count() AS total_orders,
sum(amount) AS revenue
FROM default.orders
GROUP BY country
ORDER BY revenue DESC;
This enables dashboards, reports, and real-time analytics over continuously arriving data.
Supported Kafka Message Formats
ClickHouse® supports multiple Kafka message formats through the kafka_format setting.
Some commonly used formats include:
- JSONEachRow
- JSONCompact
- CSV
- TabSeparated
- Avro
- Protobuf
- Parquet
Among these, JSONEachRow is the most widely used because it is easy to read, easy to produce, and supported by most Kafka producers.
Scaling Kafka Consumers
As event throughput increases, multiple Kafka consumers can ingest data in parallel.
For example:
kafka_num_consumers = 3
A good practice is to match the number of Kafka consumers with the number of topic partitions.
For example:
- 3 partitions → 3 consumers
- 6 partitions → 6 consumers
Adding more consumers than partitions will not improve throughput because the additional consumers remain idle.
Quick Reference
| Component | Purpose |
|---|---|
| Kafka Topic | Source of streaming events |
| Kafka Engine Table | Reads events from Kafka |
| Materialized View | Automatically inserts data into storage |
| MergeTree Table | Permanent analytical storage |
system.kafka_consumers |
Monitor consumer status and lag |
Final Thoughts
Integrating Kafka with ClickHouse® provides a simple yet powerful solution for building real-time analytics pipelines.
By combining the Kafka Engine, Materialized Views, and MergeTree tables, ClickHouse® can continuously consume streaming events and make them immediately available for analytical queries.
Whether you're building operational dashboards, monitoring systems, clickstream analytics, or event-driven data warehouses, this architecture offers a scalable and production-ready foundation with minimal operational overhead.
As your event volume grows, you can scale Kafka partitions and ClickHouse® consumers independently while maintaining fast ingestion and high-performance analytical queries.







