As ClickHouse® deployments grow beyond a single server, ensuring high availability, scalability, and fault tolerance becomes essential. While a standalone node can process billions of rows efficiently, production environments often require multiple servers working together as a cluster.
This is where ClickHouse Keeper comes in.
ClickHouse Keeper is the modern coordination service for ClickHouse®. It replaces the need for an external Apache ZooKeeper deployment while providing the same coordination capabilities required for replication, distributed DDL, and cluster management.
In this article, we'll build a production-style ClickHouse® cluster and understand how Keeper enables reliable replication across multiple nodes.
Why Clustering?
A single server has limitations. As workloads increase, organizations typically need:
- High availability
- Data replication
- Horizontal scalability
- Fault tolerance
- Cluster-wide query execution
ClickHouse addresses these requirements by separating data storage from cluster coordination.
Unlike traditional distributed databases, Keeper does not store your actual data. Instead, it manages metadata required to keep replicas synchronized.
Keeper stores:
- Replica metadata
- Part metadata
- Leader election information
- Distributed DDL queue entries
- Replication state
Your actual table data always remains on the ClickHouse servers.
Understanding Cluster Architecture
A ClickHouse cluster is built from three primary components:
- Shards
- Replicas
- Keeper Nodes
A shard stores a subset of your data.
A replica maintains a copy of that shard to provide redundancy and high availability.
Example:
Shard 1
├── Replica A
└── Replica B
Shard 2
├── Replica A
└── Replica B
In this architecture:
- Sharding distributes data horizontally.
- Replication protects against node failures.
- Keeper coordinates replication metadata.
Example Deployment
For this guide we'll use:
Keeper Nodes
keeper1.example.com
keeper2.example.com
keeper3.example.com
ClickHouse Nodes
ch-node1.example.com
ch-node2.example.com
Cluster topology:
- 1 Shard
- 2 Replicas
- 3 Keeper Nodes
Architecture:
+-------------+
| Keeper 1 |
+-------------+
|
+-------------+
| Keeper 2 |
+-------------+
|
+-------------+
| Keeper 3 |
+-------------+
▲
│
┌──────────────┴──────────────┐
│ │
+------------------+ +------------------+
| ch-node1 | | ch-node2 |
| Replica 1 | | Replica 2 |
+------------------+ +------------------+
Configuring ClickHouse Keeper
Each Keeper node requires a unique server ID.
Example configuration for keeper1:
<clickhouse>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<log_storage_path>
/var/lib/clickhouse/coordination/log
</log_storage_path>
<snapshot_storage_path>
/var/lib/clickhouse/coordination/snapshots
</snapshot_storage_path>
<raft_configuration>
<server>
<id>1</id>
<hostname>keeper1.example.com</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>keeper2.example.com</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>keeper3.example.com</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
</clickhouse>
For the remaining Keeper nodes, only the server_id changes.
keeper2
<server_id>2</server_id>
keeper3
<server_id>3</server_id>
Restart ClickHouse on every Keeper node after updating the configuration.
Configuring Keeper Connectivity
Each ClickHouse server must know how to connect to the Keeper cluster.
Create a configuration file similar to:
<clickhouse>
<zookeeper>
<node>
<host>keeper1.example.com</host>
<port>9181</port>
</node>
<node>
<host>keeper2.example.com</host>
<port>9181</port>
</node>
<node>
<host>keeper3.example.com</host>
<port>9181</port>
</node>
</zookeeper>
</clickhouse>
Although you're using ClickHouse Keeper, the configuration section remains named <zookeeper> because Keeper implements the ZooKeeper protocol.
Defining the Cluster
Next, define the cluster on every ClickHouse node.
<clickhouse>
<remote_servers>
<analytics_cluster>
<shard>
<internal_replication>true</internal_replication>
<replica>
<host>ch-node1.example.com</host>
<port>9000</port>
</replica>
<replica>
<host>ch-node2.example.com</host>
<port>9000</port>
</replica>
</shard>
</analytics_cluster>
</remote_servers>
</clickhouse>
The setting:
<internal_replication>true</internal_replication>
ensures that replication is handled by ReplicatedMergeTree rather than sending inserts to every replica.
Configuring Macros
Macros allow the same table definition to be used across all replicas.
On ch-node1
<clickhouse>
<macros>
<shard>01</shard>
<replica>ch-node1</replica>
</macros>
</clickhouse>
On ch-node2
<clickhouse>
<macros>
<shard>01</shard>
<replica>ch-node2</replica>
</macros>
</clickhouse>
During table creation, ClickHouse automatically substitutes these values.
Verifying Cluster Discovery
Restart ClickHouse on every node and verify that the cluster is recognized.
SELECT
cluster,
shard_num,
replica_num,
host_name
FROM system.clusters
WHERE cluster = 'analytics_cluster';
Expected output:
analytics_cluster
├── ch-node1
└── ch-node2
If no rows are returned, the cluster configuration has not been loaded successfully.
Creating a Replicated Table
The recommended approach combines macros with distributed DDL.
CREATE TABLE events
ON CLUSTER analytics_cluster
(
id UInt64,
event_time DateTime,
user_id UInt64
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/events',
'{replica}'
)
ORDER BY (event_time, id);
This command automatically:
- Creates the table on every node.
- Registers metadata in Keeper.
- Assigns unique replica names.
- Starts replication automatically.
How Replication Works
Suppose data is inserted into ch-node1.
INSERT INTO events VALUES
(
1,
now(),
1001
);
Replication follows this sequence:
- Data is written locally.
- Metadata is stored in Keeper.
- Other replicas detect the new part.
- Missing parts are downloaded.
- Replicas become synchronized.
Replication is asynchronous and optimized for high throughput.
Monitoring Replication
Replica health can be checked using:
SELECT
database,
table,
is_leader,
is_readonly,
queue_size,
absolute_delay
FROM system.replicas;
Important columns include:
| Column | Description |
|---|---|
| is_leader | Indicates replica leadership |
| is_readonly | Whether the replica can accept writes |
| queue_size | Pending replication tasks |
| absolute_delay | Replication lag |
Healthy replicas typically show:
is_readonly = 0
queue_size = 0
Inspecting Replication Tasks
To troubleshoot replication delays:
SELECT
database,
table,
type,
create_time
FROM system.replication_queue;
Common task types include:
- GET_PART
- MERGE_PARTS
- ATTACH_PART
A continuously growing queue often indicates disk, network, or resource bottlenecks.
Creating a Distributed Table
ReplicatedMergeTree manages replication.
Distributed manages query routing.
Create a distributed table:
CREATE TABLE events_dist
ON CLUSTER analytics_cluster
AS events
ENGINE = Distributed(
'analytics_cluster',
currentDatabase(),
'events',
cityHash64(user_id)
);
Now queries automatically execute across the cluster.
Example:
SELECT count()
FROM events_dist;
Distributed DDL
Keeper also coordinates distributed schema changes.
Without distributed DDL:
CREATE TABLE ...
must be executed on every server.
With ON CLUSTER:
CREATE TABLE test
ON CLUSTER analytics_cluster
(
id UInt64
)
ENGINE = MergeTree
ORDER BY id;
the command is stored in Keeper and executed automatically across every node.
The same mechanism applies to:
- CREATE
- ALTER
- DROP
- RENAME
Monitoring Keeper Connectivity
Verify Keeper connectivity with:
SELECT *
FROM system.zookeeper
LIMIT 10
SETTINGS allow_unrestricted_reads_from_keeper = 'true';
Successful results confirm communication between ClickHouse and Keeper.
If Keeper becomes unavailable, replicas may switch to read-only mode.
Check replica status with:
SELECT
database,
table,
is_readonly
FROM system.replicas;
Common Failure Scenarios
Keeper Quorum Loss
If a majority of Keeper nodes become unavailable, a leader cannot be elected.
Example:
3 Keeper Nodes
2 Nodes Offline
Replication pauses until quorum is restored.
Replica Lag
Typical symptoms:
queue_size increasing
absolute_delay increasing
Possible causes:
- Slow storage
- Network bottlenecks
- Heavy background merges
Misconfigured Replica Names
Each replica must have a unique identifier.
Using duplicate replica names can prevent replication from working correctly.
Using macros eliminates this risk.
Production Best Practices
For production environments:
- Deploy at least three Keeper nodes.
- Use macros for all replicated tables.
- Use
ON CLUSTERfor schema management. - Continuously monitor
system.replicas. - Monitor
system.replication_queue. - Keep Keeper storage separate from busy data disks.
- Regularly test node failure and recovery procedures.
- Document shard and replica topology.
Conclusion
ClickHouse Keeper simplifies cluster management by providing built-in coordination without requiring an external ZooKeeper deployment. Combined with replicated tables, distributed DDL, and distributed tables, it enables highly available and horizontally scalable ClickHouse clusters.
Understanding how Keeper manages metadata, coordinates replicas, and automates cluster-wide operations is a key step toward building reliable production-grade analytical systems.







