Introduction

One of the reasons ClickHouse® delivers exceptional analytical performance is its storage engine. Instead of updating data in place, ClickHouse continuously performs background merges and mutations to keep data organized and queries efficient.

While these operations happen automatically, they can sometimes become difficult to monitor. A growing number of table parts can lead to "Too many parts" errors, while long-running mutations may delay updates and consume significant system resources.

Unlike query execution, merges and mutations do not always provide clear visibility into their progress. Administrators often rely on manually checking system tables, making it difficult to identify bottlenecks before they impact production workloads.

In this article, you'll learn how merges and mutations work, why monitoring them matters, and how to troubleshoot common issues.

Understanding Background Merges

ClickHouse stores data in immutable parts.

Whenever new data is inserted, a new part is created. Over time, hundreds or even thousands of parts can accumulate.

To improve query performance and reduce storage overhead, ClickHouse automatically merges smaller parts into larger ones.

A typical merge process looks like this:

Insert Data
     │
     ▼
Part A   Part B   Part C
   │       │       │
   └───────┴───────┘
          ▼
     Background Merge
          ▼
      Larger Part

Benefits of Background Merges

Fewer files on disk
Faster query execution
Better compression
Lower metadata overhead

What Are Mutations?

A mutation modifies existing data.

Operations such as:

UPDATE
DELETE
MATERIALIZE COLUMN
MATERIALIZE INDEX

are executed as background mutations rather than immediate row updates.

Example:

ALTER TABLE events
DELETE WHERE event_date < '2024-01-01';

Instead of deleting rows instantly, ClickHouse schedules a mutation that rewrites affected parts in the background.

The time required depends on:

Table size
Number of parts
Available CPU
Disk throughput

Why Monitoring Merges and Mutations Matters

When background tasks fall behind, overall database performance can degrade.

Common symptoms include:

"Too many parts" errors
Slow inserts
High disk usage
Long-running ALTER operations
Increased CPU utilization
Delayed DELETE or UPDATE operations

These issues often appear gradually before becoming production incidents.

Monitoring Active Merges

ClickHouse exposes active merge operations through the system.merges table.

SELECT
    database,
    table,
    elapsed,
    progress,
    num_parts
FROM system.merges;

Example output:

Database	Table	Progress	Elapsed
analytics	events	0.72	35 sec
logs	access_logs	0.41	18 sec

Useful columns include:

progress
elapsed
num_parts
bytes_read_uncompressed
bytes_written_uncompressed

These values help determine whether merges are progressing normally.

Monitoring Mutations

Use system.mutations to inspect mutation status.

SELECT
    database,
    table,
    mutation_id,
    command,
    is_done,
    parts_to_do
FROM system.mutations;

Example:

Mutation	Status	Parts Remaining
mutation_25.txt	Running	48
mutation_26.txt	Completed	0

Important fields include:

is_done
parts_to_do
latest_failed_part
latest_fail_reason

These columns quickly reveal stalled or failed mutations.

Detecting Too Many Parts

Having thousands of small parts negatively affects performance.

Use the following query to identify tables with excessive part counts:

SELECT
    database,
    table,
    count() AS parts
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY parts DESC;

If a table contains an unusually high number of parts, background merges may not be keeping up with incoming inserts.

Common causes include:

Very small insert batches
Heavy concurrent ingestion
Slow storage
Limited background threads

Viewing Merge Queue Activity

You can also monitor merge-related metrics through ClickHouse system metrics.

Useful metrics include:

Active merges
Background pool utilization
Pending merge tasks
Mutation queue length

These metrics are especially useful when building Grafana dashboards for production monitoring.

Common Reasons Merges Fall Behind

Small Insert Sizes

Frequent tiny inserts create excessive numbers of parts.

Instead of repeatedly inserting:

1 row
1 row
1 row

Batch inserts whenever possible:

10,000 rows

Larger batches significantly reduce merge pressure.

Heavy Disk I/O

Merges continuously read and rewrite data parts.

If storage cannot keep up, merge throughput decreases.

SSD or NVMe storage greatly improves merge performance.

Large Mutations

Large UPDATE or DELETE operations require many parts to be rewritten.

Whenever possible, perform large mutations in smaller batches to reduce resource consumption.

Background Thread Saturation

ClickHouse performs merges using background worker threads.

If these workers remain fully occupied, merge tasks begin to queue.

Monitoring background thread utilization helps detect this issue early.

Best Practices

To keep merges and mutations healthy:

Batch inserts instead of inserting individual rows.
Monitor active part counts regularly.
Watch for long-running mutations.
Use fast storage for production workloads.
Avoid frequent large DELETE operations.
Schedule heavy maintenance during off-peak hours.
Build dashboards around system tables for continuous monitoring.

Example Monitoring Queries

Largest tables by part count

SELECT
    table,
    count()
FROM system.parts
WHERE active
GROUP BY table
ORDER BY count() DESC;

Running merges

SELECT *
FROM system.merges;

Pending mutations

SELECT *
FROM system.mutations
WHERE is_done = 0;

Building Dashboards

Many production teams collect data from:

system.merges
system.mutations
system.parts
system.metrics
system.events

A monitoring dashboard can display:

Active merges
Pending mutations
Part count per table
Merge throughput
Background thread utilization

Continuous visibility into these metrics helps identify issues before they affect production workloads.

Conclusion

Background merges and mutations are essential to maintaining ClickHouse performance, but they often operate behind the scenes. Without proper monitoring, issues such as excessive part counts, stalled mutations, and merge backlogs can quietly grow until they affect ingestion and query performance.

By regularly inspecting system.merges, system.mutations, and system.parts, batching inserts efficiently, and monitoring background activity with dashboards, you can detect problems early and keep your ClickHouse cluster running smoothly.

Understanding these internal processes is a key step toward operating ClickHouse reliably at scale.

Read the full article -> https://www.quantrail-data.com/merges-and-mutations-in-clickhouse