Advanced ClickHouse® Aggregating Functions
Introduction
Aggregation is one of the key capabilities that makes ClickHouse® an exceptional analytical database. While functions like SUM(), COUNT(), and AVG() are commonly used, modern analytics often require far more advanced aggregation techniques.
Whether you're building observability platforms, financial dashboards, recommendation engines, or large-scale event analytics systems, advanced aggregate functions help you process billions of records efficiently while minimizing memory usage and computation time.
In this article, we'll explore ClickHouse®'s advanced aggregation functions, understand how they work, and learn when to use each one.
Understanding Aggregate Function States
One of the most powerful concepts in ClickHouse® is aggregate states.
Unlike traditional databases that perform aggregation in a single step, ClickHouse® separates aggregation into two phases:
- Building the aggregation state
- Finalizing the aggregation result
This behavior is exposed through aggregate combinators such as:
sumState()
sumMerge()
avgState()
avgMerge()
uniqState()
uniqMerge()
Example:
SELECT
uniqState(user_id) AS state
FROM events;
Instead of returning the final distinct count, this query produces an intermediate aggregation state.
The state can later be merged:
SELECT
uniqMerge(state)
FROM aggregated_events;
This design enables several powerful features, including:
- Materialized Views
- AggregatingMergeTree
- Incremental aggregations
- Distributed query processing
Cardinality Estimation Functions
Counting distinct values becomes increasingly expensive as datasets grow.
A query like:
SELECT count(DISTINCT user_id)
FROM events;
requires maintaining a large in-memory hash set, making it inefficient for massive datasets.
ClickHouse® offers several optimized functions for cardinality estimation.
uniq()
SELECT uniq(user_id)
FROM events;
Features
- Approximate result
- Extremely fast
- Low memory usage
- Recommended for most analytical workloads
uniq() offers an excellent balance between speed, accuracy, and resource consumption.
uniqExact()
SELECT uniqExact(user_id)
FROM events;
Features
- Exact result
- Higher memory usage
- Slower execution
Use it when precision is critical, such as:
- Financial reporting
- Compliance
- Auditing
Avoid it for extremely large high-cardinality datasets unless exact counts are mandatory.
uniqCombined()
SELECT uniqCombined(user_id)
FROM events;
Features
- Hybrid algorithm
- High accuracy
- Better scalability
- Lower memory usage than exact counting
Ideal for very large datasets where accuracy remains important.
uniqHLL12()
SELECT uniqHLL12(user_id)
FROM events;
This function uses the HyperLogLog algorithm.
Features
- Fixed memory usage
- Approximate results
- Scales efficiently
A good choice when predictable memory consumption is more important than maximum accuracy.
uniqTheta()
SELECT uniqTheta(user_id)
FROM events;
Based on Theta Sketches, this function provides:
- Efficient sketch merging
- Excellent distributed performance
- High scalability
It is commonly used in distributed analytics environments.
Quantile Functions
Average values rarely tell the complete story.
Consider this query:
SELECT avg(response_time_ms)
FROM requests;
An average response time of 100 ms may still hide occasional requests taking several seconds.
Percentile calculations help reveal these latency spikes.
quantile()
SELECT quantile(0.95)(response_time_ms)
FROM requests;
Returns the 95th percentile, meaning that 95% of requests complete within this value.
quantiles()
Multiple percentiles can be calculated simultaneously.
SELECT
quantiles(0.5, 0.9, 0.95, 0.99)(response_time_ms)
FROM requests;
This returns:
- Median (P50)
- P90
- P95
- P99
Computing them together is more efficient than separate queries.
quantileExact()
SELECT
quantileExact(0.99)(response_time_ms)
FROM requests;
Features
- Exact percentile
- Higher memory usage
- More computationally expensive
Suitable for workloads requiring precise percentile calculations.
quantileTDigest()
SELECT
quantileTDigest(0.99)(response_time_ms)
FROM requests;
Uses the T-Digest algorithm.
Benefits
- Excellent tail latency estimation
- Efficient memory usage
- High accuracy
Widely used in:
- API monitoring
- Infrastructure observability
- Database performance monitoring
quantileTiming()
SELECT
quantileTiming(0.99)(response_time_ms)
FROM requests;
Optimized specifically for timing distributions.
Ideal for measuring:
- API response times
- Service latency
- Request durations
Top-K Analysis
Many analytical workloads require identifying the most frequent values.
Examples include:
- Most viewed products
- Most searched keywords
- Most active users
- Most common error codes
topK()
SELECT
topK(10)(product_id)
FROM orders;
Returns the ten most frequent products.
Features
- Approximate
- Very efficient
- Suitable for streaming-scale workloads
Instead of sorting every unique value, ClickHouse® maintains a compact data structure that tracks the most frequent elements.
topKWeighted()
SELECT
topKWeighted(10)(product_id, quantity)
FROM sales;
This version considers weights during aggregation.
Useful for ranking based on:
- Revenue
- Purchase quantity
- Weighted popularity
Statistical Aggregation Functions
ClickHouse® also includes built-in statistical functions.
varPop() and varSamp()
Population variance:
SELECT varPop(price)
FROM products;
Sample variance:
SELECT varSamp(price)
FROM products;
stddevPop() and stddevSamp()
Population standard deviation:
SELECT stddevPop(price)
FROM products;
Sample standard deviation:
SELECT stddevSamp(price)
FROM products;
Useful for:
- Risk analysis
- Anomaly detection
- Data quality monitoring
covarPop()
SELECT
covarPop(x, y)
FROM measurements;
Calculates covariance between two variables.
corr()
SELECT
corr(x, y)
FROM measurements;
Returns the Pearson correlation coefficient ranging from -1 to +1.
Commonly used for:
- Exploratory analytics
- Feature engineering
- Statistical analysis
Bitmap-Based Aggregations
For extremely large user sets, bitmap aggregation provides an efficient alternative to traditional distinct counting.
groupBitmap()
SELECT
groupBitmap(user_id)
FROM events;
Creates a bitmap representation of users.
bitmapCardinality()
SELECT
bitmapCardinality(bitmap_column)
FROM users;
Returns the number of distinct values stored in the bitmap.
Benefits
- Compact storage
- Fast set operations
- Efficient merging
Frequently used in:
- User segmentation
- Audience analytics
- Advertising platforms
Array-Based Aggregations
Sometimes analytical queries require collecting values rather than producing a single aggregate.
groupArray()
SELECT
user_id,
groupArray(page)
FROM visits
GROUP BY user_id;
Example output:
["home","pricing","checkout"]
Useful for user journey and clickstream analysis.
groupUniqArray()
SELECT
groupUniqArray(page)
FROM visits;
Collects only unique values while aggregating.
groupArraySorted()
SELECT
groupArraySorted(10)(score)
FROM results;
Returns sorted aggregated values.
AggregatingMergeTree and Aggregate States
One of the most powerful production patterns combines:
- Aggregate states
- Materialized Views
- AggregatingMergeTree
Example table:
CREATE TABLE daily_users
(
date Date,
users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY date;
Materialized View:
CREATE MATERIALIZED VIEW mv_daily_users
TO daily_users
AS
SELECT
toDate(event_time) AS date,
uniqState(user_id) AS users
FROM events
GROUP BY date;
Querying the aggregated data:
SELECT
date,
uniqMerge(users)
FROM daily_users
GROUP BY date;
Benefits
- Faster analytical queries
- Lower CPU usage
- Incremental aggregation
- Better scalability
This architecture is widely adopted in production ClickHouse® deployments.
Choosing the Right Aggregation Function
| Requirement | Recommended Function |
|---|---|
| Fast distinct counting | uniq() |
| Exact distinct counting | uniqExact() |
| Massive cardinality | uniqCombined() |
| Fixed-memory cardinality | uniqHLL12() |
| Approximate percentile | quantile() |
| High-quality percentile estimation | quantileTDigest() |
| Exact percentile | quantileExact() |
| Most frequent values | topK() |
| User segmentation | groupBitmap() |
| Incremental aggregation |
State() + Merge() functions |
Conclusion
Advanced aggregation functions are one of the primary reasons ClickHouse® delivers exceptional analytical performance at scale.
The key concepts to master include:
- Aggregate states using
State()andMerge() - Approximate distinct counting with the
uniqfamily - Percentile analysis using
quantile()andquantileTDigest() - Heavy-hitter detection with
topK() - Bitmap-based aggregations for large user sets
- Incremental aggregation using
AggregatingMergeTree
Choosing the appropriate aggregation function for your workload often has a greater impact on performance, scalability, and resource usage than simply adding more hardware. By understanding these advanced capabilities, you can build faster, more efficient analytical systems with ClickHouse®.













