Introduction
When designing tables in ClickHouse®, one of the most important decisions you'll make is choosing the right primary key. Unlike traditional relational databases, where primary keys enforce uniqueness, ClickHouse® uses the primary key to organize data on disk and optimize query performance.
A well-designed primary key allows ClickHouse® to skip unnecessary data during query execution, significantly reducing the amount of data scanned. On the other hand, a poorly chosen primary key can lead to slower queries and inefficient resource usage.
In this article, we'll explore how primary keys work in ClickHouse®, common mistakes to avoid, and best practices for selecting the right primary key for your analytical workloads.
Understanding Primary Keys in ClickHouse®
Developers coming from databases like PostgreSQL or MySQL often expect primary keys to prevent duplicate values.
For example:
CREATE TABLE users
(
id UInt64,
name String
)
ENGINE = MergeTree
ORDER BY id;
Although id acts as the primary key, ClickHouse® does not enforce uniqueness.
Instead, the primary key determines how data is physically sorted and indexed, enabling ClickHouse® to efficiently skip irrelevant data during query execution.
The goal of a primary key is performance—not uniqueness.
How ClickHouse® Uses the Primary Key
ClickHouse® stores a sparse index based on the primary key.
When queries filter on primary key columns, ClickHouse® quickly identifies which data ranges must be read while skipping the rest.
For example:
SELECT *
FROM events
WHERE user_id = 12345;
If the table is defined as:
ORDER BY user_id
ClickHouse® reads only the relevant granules instead of scanning the entire table.
This technique, known as data skipping, is one of the key reasons ClickHouse® delivers exceptional analytical query performance.
Design Around Query Patterns
One of the most common mistakes is selecting a primary key based only on the table schema.
Instead, ask questions like:
- How will users query the data?
- Which columns appear most often in
WHEREclauses? - What filters are used in dashboards?
- Which dimensions are commonly analyzed?
Consider an observability workload:
SELECT *
FROM logs
WHERE service_name = 'payments'
AND timestamp >= now() - INTERVAL 1 DAY;
A better primary key would be:
ORDER BY (service_name, timestamp)
rather than:
ORDER BY log_level
because it aligns with real query patterns.
Prioritize Frequently Filtered Columns
The leading columns of the primary key should usually be those that appear most frequently in filtering conditions.
Example:
ORDER BY (customer_id, event_time)
This design performs well when queries commonly filter by customer.
Example query:
SELECT *
FROM events
WHERE customer_id = 1001;
Since rows are organized by customer_id, ClickHouse® can efficiently locate only the required data.
Choosing rarely filtered columns as the leading key often provides little performance benefit.
Primary Keys for Time-Series Data
Many ClickHouse® workloads involve time-series data such as:
- Application logs
- Metrics
- User activity
- IoT telemetry
- Monitoring events
A common approach is:
ORDER BY timestamp
While valid, this is not always the most efficient.
In many production environments, a composite key works better:
ORDER BY (service_name, timestamp)
or
ORDER BY (host, timestamp)
This allows ClickHouse® to narrow results using both the entity and the time range.
The ideal design depends on your workload and query patterns.
Avoid Random Leading Columns
Highly random columns rarely make good leading primary key components.
Examples include:
ORDER BY uuid
or
ORDER BY transaction_id
Although these are valid, random values reduce ClickHouse®'s ability to skip large portions of data efficiently.
Choose columns that help eliminate irrelevant data during query execution.
Consider Cardinality
Cardinality refers to the number of unique values in a column.
Examples of low-cardinality columns:
- country
- status
- environment
Examples of high-cardinality columns:
- user_id
- uuid
A common strategy is to place useful filtering dimensions before highly granular identifiers.
For example:
ORDER BY (country, user_id)
may perform better than:
ORDER BY (user_id, country)
depending on how the data is queried.
Always validate your assumptions using real workloads.
Primary Keys vs Partitions
Primary keys and partitions serve different purposes.
Example:
PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp)
Partitions help with:
- Data retention
- Lifecycle management
- Partition pruning
Primary keys help with:
- Data skipping
- Faster queries
- Efficient reads
In most analytical workloads, the primary key has a greater impact on query performance than partitioning.
Validate with EXPLAIN
One of the best ways to evaluate a primary key is by examining the execution plan.
Example:
EXPLAIN PLAN
SELECT *
FROM events
WHERE user_id = 1001;
Review important metrics such as:
- Granules scanned
- Parts accessed
- Filtering effectiveness
Testing different primary key designs helps identify the most efficient configuration for your workload.
Common Primary Key Mistakes
Avoid these common mistakes:
- Choosing keys based on uniqueness instead of query patterns.
- Ignoring how users actually query the data.
- Optimizing only for rare queries.
- Using random identifiers as leading columns.
- Copying designs from unrelated workloads without testing.
Every dataset has different access patterns, so primary keys should always be workload-specific.
Best Practices
When choosing a primary key in ClickHouse®:
- Design around query patterns.
- Start with frequently filtered columns.
- Consider how users access the data.
- Avoid random leading columns.
- Balance cardinality carefully.
- Use
EXPLAINto validate performance. - Test with realistic production workloads.
- Remember that primary keys improve performance—not uniqueness.
Conclusion
Choosing the right primary key is one of the most important table design decisions in ClickHouse®.
A well-designed primary key allows ClickHouse® to skip large portions of data, reduce query latency, and improve overall efficiency. Conversely, a poor primary key can force unnecessary scans and limit performance regardless of hardware resources.
The key takeaway is simple: design primary keys based on how your data is queried, not how it is stored. Following this principle will help you build ClickHouse® tables that scale efficiently as your datasets continue to grow.







