Introduction

When designing tables in ClickHouse®, one of the most important decisions you'll make is choosing the right primary key. Unlike traditional relational databases, where primary keys enforce uniqueness, ClickHouse® uses the primary key to organize data on disk and optimize query performance.

A well-designed primary key allows ClickHouse® to skip unnecessary data during query execution, significantly reducing the amount of data scanned. On the other hand, a poorly chosen primary key can lead to slower queries and inefficient resource usage.

In this article, we'll explore how primary keys work in ClickHouse®, common mistakes to avoid, and best practices for selecting the right primary key for your analytical workloads.

Understanding Primary Keys in ClickHouse®

Developers coming from databases like PostgreSQL or MySQL often expect primary keys to prevent duplicate values.

For example:

CREATE TABLE users
(
    id UInt64,
    name String
)
ENGINE = MergeTree
ORDER BY id;

Although id acts as the primary key, ClickHouse® does not enforce uniqueness.

Instead, the primary key determines how data is physically sorted and indexed, enabling ClickHouse® to efficiently skip irrelevant data during query execution.

The goal of a primary key is performance—not uniqueness.

How ClickHouse® Uses the Primary Key

ClickHouse® stores a sparse index based on the primary key.

When queries filter on primary key columns, ClickHouse® quickly identifies which data ranges must be read while skipping the rest.

For example:

SELECT *
FROM events
WHERE user_id = 12345;

If the table is defined as:

ORDER BY user_id

ClickHouse® reads only the relevant granules instead of scanning the entire table.

This technique, known as data skipping, is one of the key reasons ClickHouse® delivers exceptional analytical query performance.

Design Around Query Patterns

One of the most common mistakes is selecting a primary key based only on the table schema.

Instead, ask questions like:

How will users query the data?
Which columns appear most often in WHERE clauses?
What filters are used in dashboards?
Which dimensions are commonly analyzed?

Consider an observability workload:

SELECT *
FROM logs
WHERE service_name = 'payments'
  AND timestamp >= now() - INTERVAL 1 DAY;

A better primary key would be:

ORDER BY (service_name, timestamp)

rather than:

ORDER BY log_level

because it aligns with real query patterns.

Prioritize Frequently Filtered Columns

The leading columns of the primary key should usually be those that appear most frequently in filtering conditions.

Example:

ORDER BY (customer_id, event_time)

This design performs well when queries commonly filter by customer.

Example query:

SELECT *
FROM events
WHERE customer_id = 1001;

Since rows are organized by customer_id, ClickHouse® can efficiently locate only the required data.

Choosing rarely filtered columns as the leading key often provides little performance benefit.

Primary Keys for Time-Series Data

Many ClickHouse® workloads involve time-series data such as:

Application logs
Metrics
User activity
IoT telemetry
Monitoring events

A common approach is:

ORDER BY timestamp

While valid, this is not always the most efficient.

In many production environments, a composite key works better:

ORDER BY (service_name, timestamp)

ORDER BY (host, timestamp)

This allows ClickHouse® to narrow results using both the entity and the time range.

The ideal design depends on your workload and query patterns.

Avoid Random Leading Columns

Highly random columns rarely make good leading primary key components.

Examples include:

ORDER BY uuid

ORDER BY transaction_id

Although these are valid, random values reduce ClickHouse®'s ability to skip large portions of data efficiently.

Choose columns that help eliminate irrelevant data during query execution.

Consider Cardinality

Cardinality refers to the number of unique values in a column.

Examples of low-cardinality columns:

country
status
environment

Examples of high-cardinality columns:

user_id
email
uuid

A common strategy is to place useful filtering dimensions before highly granular identifiers.

For example:

ORDER BY (country, user_id)

may perform better than:

ORDER BY (user_id, country)

depending on how the data is queried.

Always validate your assumptions using real workloads.

Primary Keys vs Partitions

Primary keys and partitions serve different purposes.

Example:

PARTITION BY toYYYYMM(timestamp)
ORDER BY (user_id, timestamp)

Partitions help with:

Data retention
Lifecycle management
Partition pruning

Primary keys help with:

Data skipping
Faster queries
Efficient reads

In most analytical workloads, the primary key has a greater impact on query performance than partitioning.

Validate with EXPLAIN

One of the best ways to evaluate a primary key is by examining the execution plan.

Example:

EXPLAIN PLAN
SELECT *
FROM events
WHERE user_id = 1001;

Review important metrics such as:

Granules scanned
Parts accessed
Filtering effectiveness

Testing different primary key designs helps identify the most efficient configuration for your workload.

Common Primary Key Mistakes

Avoid these common mistakes:

Choosing keys based on uniqueness instead of query patterns.
Ignoring how users actually query the data.
Optimizing only for rare queries.
Using random identifiers as leading columns.
Copying designs from unrelated workloads without testing.

Every dataset has different access patterns, so primary keys should always be workload-specific.

Best Practices

When choosing a primary key in ClickHouse®:

Design around query patterns.
Start with frequently filtered columns.
Consider how users access the data.
Avoid random leading columns.
Balance cardinality carefully.
Use EXPLAIN to validate performance.
Test with realistic production workloads.
Remember that primary keys improve performance—not uniqueness.

Conclusion

Choosing the right primary key is one of the most important table design decisions in ClickHouse®.

A well-designed primary key allows ClickHouse® to skip large portions of data, reduce query latency, and improve overall efficiency. Conversely, a poor primary key can force unnecessary scans and limit performance regardless of hardware resources.

The key takeaway is simple: design primary keys based on how your data is queried, not how it is stored. Following this principle will help you build ClickHouse® tables that scale efficiently as your datasets continue to grow.