Advanced ClickHouse® Aggregating Functions

Introduction

Aggregation is one of the key capabilities that makes ClickHouse® an exceptional analytical database. While functions like SUM(), COUNT(), and AVG() are commonly used, modern analytics often require far more advanced aggregation techniques.

Whether you're building observability platforms, financial dashboards, recommendation engines, or large-scale event analytics systems, advanced aggregate functions help you process billions of records efficiently while minimizing memory usage and computation time.

In this article, we'll explore ClickHouse®'s advanced aggregation functions, understand how they work, and learn when to use each one.

Understanding Aggregate Function States

One of the most powerful concepts in ClickHouse® is aggregate states.

Unlike traditional databases that perform aggregation in a single step, ClickHouse® separates aggregation into two phases:

Building the aggregation state
Finalizing the aggregation result

This behavior is exposed through aggregate combinators such as:

sumState()
sumMerge()
avgState()
avgMerge()
uniqState()
uniqMerge()

Example:

SELECT
    uniqState(user_id) AS state
FROM events;

Instead of returning the final distinct count, this query produces an intermediate aggregation state.

The state can later be merged:

SELECT
    uniqMerge(state)
FROM aggregated_events;

This design enables several powerful features, including:

Materialized Views
AggregatingMergeTree
Incremental aggregations
Distributed query processing

Cardinality Estimation Functions

Counting distinct values becomes increasingly expensive as datasets grow.

A query like:

SELECT count(DISTINCT user_id)
FROM events;

requires maintaining a large in-memory hash set, making it inefficient for massive datasets.

ClickHouse® offers several optimized functions for cardinality estimation.

uniq()

SELECT uniq(user_id)
FROM events;

Features

Approximate result
Extremely fast
Low memory usage
Recommended for most analytical workloads

uniq() offers an excellent balance between speed, accuracy, and resource consumption.

uniqExact()

SELECT uniqExact(user_id)
FROM events;

Features

Exact result
Higher memory usage
Slower execution

Use it when precision is critical, such as:

Financial reporting
Compliance
Auditing

Avoid it for extremely large high-cardinality datasets unless exact counts are mandatory.

uniqCombined()

SELECT uniqCombined(user_id)
FROM events;

Features

Hybrid algorithm
High accuracy
Better scalability
Lower memory usage than exact counting

Ideal for very large datasets where accuracy remains important.

uniqHLL12()

SELECT uniqHLL12(user_id)
FROM events;

This function uses the HyperLogLog algorithm.

Features

Fixed memory usage
Approximate results
Scales efficiently

A good choice when predictable memory consumption is more important than maximum accuracy.

uniqTheta()

SELECT uniqTheta(user_id)
FROM events;

Based on Theta Sketches, this function provides:

Efficient sketch merging
Excellent distributed performance
High scalability

It is commonly used in distributed analytics environments.

Quantile Functions

Average values rarely tell the complete story.

Consider this query:

SELECT avg(response_time_ms)
FROM requests;

An average response time of 100 ms may still hide occasional requests taking several seconds.

Percentile calculations help reveal these latency spikes.

quantile()

SELECT quantile(0.95)(response_time_ms)
FROM requests;

Returns the 95th percentile, meaning that 95% of requests complete within this value.

quantiles()

Multiple percentiles can be calculated simultaneously.

SELECT
    quantiles(0.5, 0.9, 0.95, 0.99)(response_time_ms)
FROM requests;

This returns:

Median (P50)
P90
P95
P99

Computing them together is more efficient than separate queries.

quantileExact()

SELECT
    quantileExact(0.99)(response_time_ms)
FROM requests;

Features

Exact percentile
Higher memory usage
More computationally expensive

Suitable for workloads requiring precise percentile calculations.

quantileTDigest()

SELECT
    quantileTDigest(0.99)(response_time_ms)
FROM requests;

Uses the T-Digest algorithm.

Benefits

Excellent tail latency estimation
Efficient memory usage
High accuracy

Widely used in:

API monitoring
Infrastructure observability
Database performance monitoring

quantileTiming()

SELECT
    quantileTiming(0.99)(response_time_ms)
FROM requests;

Optimized specifically for timing distributions.

Ideal for measuring:

API response times
Service latency
Request durations

Top-K Analysis

Many analytical workloads require identifying the most frequent values.

Examples include:

Most viewed products
Most searched keywords
Most active users
Most common error codes

topK()

SELECT
    topK(10)(product_id)
FROM orders;

Returns the ten most frequent products.

Features

Approximate
Very efficient
Suitable for streaming-scale workloads

Instead of sorting every unique value, ClickHouse® maintains a compact data structure that tracks the most frequent elements.

topKWeighted()

SELECT
    topKWeighted(10)(product_id, quantity)
FROM sales;

This version considers weights during aggregation.

Useful for ranking based on:

Revenue
Purchase quantity
Weighted popularity

Statistical Aggregation Functions

ClickHouse® also includes built-in statistical functions.

varPop() and varSamp()

Population variance:

SELECT varPop(price)
FROM products;

Sample variance:

SELECT varSamp(price)
FROM products;

stddevPop() and stddevSamp()

Population standard deviation:

SELECT stddevPop(price)
FROM products;

Sample standard deviation:

SELECT stddevSamp(price)
FROM products;

Useful for:

Risk analysis
Anomaly detection
Data quality monitoring

covarPop()

SELECT
    covarPop(x, y)
FROM measurements;

Calculates covariance between two variables.

corr()

SELECT
    corr(x, y)
FROM measurements;

Returns the Pearson correlation coefficient ranging from -1 to +1.

Commonly used for:

Exploratory analytics
Feature engineering
Statistical analysis

Bitmap-Based Aggregations

For extremely large user sets, bitmap aggregation provides an efficient alternative to traditional distinct counting.

groupBitmap()

SELECT
    groupBitmap(user_id)
FROM events;

Creates a bitmap representation of users.

bitmapCardinality()

SELECT
    bitmapCardinality(bitmap_column)
FROM users;

Returns the number of distinct values stored in the bitmap.

Benefits

Compact storage
Fast set operations
Efficient merging

Frequently used in:

User segmentation
Audience analytics
Advertising platforms

Array-Based Aggregations

Sometimes analytical queries require collecting values rather than producing a single aggregate.

groupArray()

SELECT
    user_id,
    groupArray(page)
FROM visits
GROUP BY user_id;

Example output:

["home","pricing","checkout"]

Useful for user journey and clickstream analysis.

groupUniqArray()

SELECT
    groupUniqArray(page)
FROM visits;

Collects only unique values while aggregating.

groupArraySorted()

SELECT
    groupArraySorted(10)(score)
FROM results;

Returns sorted aggregated values.

AggregatingMergeTree and Aggregate States

One of the most powerful production patterns combines:

Aggregate states
Materialized Views
AggregatingMergeTree

Example table:

CREATE TABLE daily_users
(
    date Date,
    users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY date;

Materialized View:

CREATE MATERIALIZED VIEW mv_daily_users
TO daily_users
AS
SELECT
    toDate(event_time) AS date,
    uniqState(user_id) AS users
FROM events
GROUP BY date;

Querying the aggregated data:

SELECT
    date,
    uniqMerge(users)
FROM daily_users
GROUP BY date;

Benefits

Faster analytical queries
Lower CPU usage
Incremental aggregation
Better scalability

This architecture is widely adopted in production ClickHouse® deployments.

Choosing the Right Aggregation Function

Requirement	Recommended Function
Fast distinct counting	`uniq()`
Exact distinct counting	`uniqExact()`
Massive cardinality	`uniqCombined()`
Fixed-memory cardinality	`uniqHLL12()`
Approximate percentile	`quantile()`
High-quality percentile estimation	`quantileTDigest()`
Exact percentile	`quantileExact()`
Most frequent values	`topK()`
User segmentation	`groupBitmap()`
Incremental aggregation	`State()` + `Merge()` functions

Conclusion

Advanced aggregation functions are one of the primary reasons ClickHouse® delivers exceptional analytical performance at scale.

The key concepts to master include:

Aggregate states using State() and Merge()
Approximate distinct counting with the uniq family
Percentile analysis using quantile() and quantileTDigest()
Heavy-hitter detection with topK()
Bitmap-based aggregations for large user sets
Incremental aggregation using AggregatingMergeTree

Choosing the appropriate aggregation function for your workload often has a greater impact on performance, scalability, and resource usage than simply adding more hardware. By understanding these advanced capabilities, you can build faster, more efficient analytical systems with ClickHouse®.

Day 34: Advanced ClickHouse® Aggregating Functions

Advanced ClickHouse® Aggregating Functions

Introduction

Understanding Aggregate Function States

Cardinality Estimation Functions

uniq()

Features

uniqExact()

Features

uniqCombined()

Features

uniqHLL12()

Features

uniqTheta()

Quantile Functions

quantile()

quantiles()

quantileExact()

Features

quantileTDigest()

Benefits

quantileTiming()

Top-K Analysis

topK()

Features

topKWeighted()

Statistical Aggregation Functions

varPop() and varSamp()

stddevPop() and stddevSamp()

covarPop()

corr()

Bitmap-Based Aggregations

groupBitmap()

bitmapCardinality()

Benefits

Array-Based Aggregations

groupArray()

groupUniqArray()

groupArraySorted()

AggregatingMergeTree and Aggregate States

Benefits

Choosing the Right Aggregation Function

Conclusion

Tags

Author

Stats

Published

You Might Also Like

Too cheap to be good? Think again.

The Winner of the AI-Pocalypse? The Full-Stack Generalist (But Probably Later Instead of Sooner)

The Stale Feature Flag We Deleted That Turned a Feature Back On

Some friends wanted to see how I use DigitalOcean. So I built them the smallest real app I could.

Trust Isn't a Scalar: Typed Provenance for Agent Chains

I Made Claude Code Think Before It Codes. Then I Gave It a Team.