Part 1 of 3 — How We Got MongoDB to 5,000 Strict-ACID TPS at p99 20 ms
The short version: A strict-ACID payments workload sailed through our
latency target on a fresh cluster. Then we loaded the data volume a real
production system carries — 100 million ledger rows — and p99 jumped from
single-digit milliseconds to 234 ms, failing every run. Nothing about the
transaction changed. This is the story of why — and how we drove it back down.
The mission
One card authorization, fully ACID — four operations that must all commit or all
roll back:
- check the card balance and debit it (only if funds are sufficient),
- insert an append-only ledger record,
- increment the cardholder's rolling counter,
- increment the merchant's rolling counter.
The bar, set up front and never moved: 5,000 transactions per second,
sustained, at p99 ≤ 20 ms, with writeConcern: majority — i.e. every commit
journaled and replicated to a majority of the replica set before it's
acknowledged. No fire-and-forget. No relaxed isolation. The real thing.
What one authorization records
The ledger insert isn't a thin row — it's a rich record of the whole event. The production schema is ~110 fields, ~2.5 KB, grouped into nested sections:
{
type, amount, currency, original_amount, status, cardId, mid, ts, // identification
card_snapshot: { masked_pan, scheme, tier, network, bin, issuer, country, ... },
merchant_snapshot: { brand, mcc, mcc_name, terminal, city, state, risk_tier, settlement },
auth: { rrn, arn, stan, approval_code, auth_method, acquirer_id },
risk: { score, rules_triggered: [ ... ], decision, engine_version },
velocity_snapshot: { card_count_24h, card_sum_24h, merch_count_1h, merch_sum_1h },
operational: { response_time_ms, network_path, correlation_id, trace_id },
settlement: { batch_id, settled_at, settlement_amount, interchange_fee, net },
three_ds: { enrolled, authenticated, eci, cavv },
device: { device_id, ua, ip_hash, os },
geo: { city, state, country, lat, lon },
trace: [ { step, ts, latency, outcome, node }, ... ] // AUTH → RISK → VELOCITY → COMMIT → SETTLE
}
And it isn't only the document: the ledger carries its production indexes — _id
plus three secondaries (cardId + ts, mid + ts, rrn) — so every insert also
writes four index entries.
One thing to flag now, because it drives Part 2: our first benchmark ran a
heavier ~400-field, ~7 KB variant of this record — nearly 3× the production
schema above. That gap turns out to matter a great deal.
How that insert becomes durable
Inside the transaction, the four writes don't hit disk independently. They're
applied against a single consistent snapshot and stay invisible to everyone else until commit. Durability happens at commitTransaction, and with writeConcern: "majority" it's a two-part guarantee:
- On the primary — WiredTiger commits all four operations atomically at one commit timestamp, together with the oplog entry that describes them (the oplog is just another collection, so the data and its replication record commit as a single all-or-nothing unit). That commit is made durable by WiredTiger's write-ahead journal — the on-disk log that survives a crash until the next checkpoint folds the changes into the data files.
- Across the replica set — secondaries replicate that oplog entry, apply it, and acknowledge. Only once a majority of voting nodes have durably applied it does the primary acknowledge the commit to the application.
The second part is what matters under failure: a majority-committed write is guaranteed to be present on whatever node becomes primary next, so an election can't roll it back. It's also where much of the latency lives — a good part of the ~5.4 ms baseline below is those journal-flush and cross-node replication round trips, not the four writes themselves. Durability isn't free; majority is us choosing to pay for it, and every number in this series is measured with that bill included.
Journaling vs. checkpointing — and why the second one bites
The journal is only half of how WiredTiger keeps data safe. The journal is a continuous write-ahead log: every commit appends to it, and that's what makes an individual write durable the moment it's acknowledged. But the journal isn't the data — it's a replay log. The actual data files are brought up to date separately, in a checkpoint.
Roughly every 60 seconds, WiredTiger takes a checkpoint: it flushes all the dirty pages that have accumulated in cache out to the data files as one consistent on-disk image, then trims the journal it no longer needs. Between checkpoints, recent changes live in memory (and in the journal, for crash safety); at each checkpoint, they get
written down. After a crash, recovery is "load the last checkpoint, replay the journal since then" — which is why checkpoints keep restart times bounded.
Hold onto the word flush, because a checkpoint isn't a steady trickle — it's a periodic burst of write I/O layered on top of the normal write stream. When that burst is larger than the disk can absorb, every transaction unlucky enough to commit during it waits. That is the entire plot of Part 2.
Four design decisions made before any code
These are the choices that separate a benchmark that means something from one that doesn't — and they're the same choices you'd make designing the real system:
-
Strict ACID via
withTransaction. All four operations in one multi-document transaction withmajoritydurability.withTransactionauto-retries transient errors (write conflicts, network blips) — without it, any number you report under contention is fiction. - Bucketed merchant counters; single cardholder counter. Traffic is skewed — a small fraction of merchants take most of the volume. A hot merchant's counter would be a single contended document, which under MVCC means write-conflict storms and a wrecked p99. So each merchant's count is spread across many random sub-document buckets and summed on read. Cardholders, by contrast, are spread uniformly across a million accounts, so one counter document each is fine. This is the single most important modeling decision in the project.
- Closed-loop load with think-time. "5,000 concurrent users" is modeled as a few thousand persistent sessions, each transacting roughly twice a second — not an open firehose. Think-time is what lets high concurrency coexist with a tight latency SLA; only a few dozen transactions are ever in flight at once. Drop the think-time and you're running a breaking-point test, not an SLA test.
- Co-located driver and cluster. Same region, peered network, warm RTT well under a millisecond. Public-internet latency alone would blow a 20 ms budget on the first hop. Non-negotiable.
The starting line — the environment
| Component | Spec |
|---|---|
| Database | MongoDB Enterprise 8.3.2 |
| Topology | 3-node replica set (rs0) — 1 primary, 2 secondaries |
| Nodes | 3 × m7i.4xlarge — 16 vCPU / 64 GB each |
| Location | a single availability zone, AWS Mumbai (ap-south-1)
|
| Storage | gp3 SSD, default provisioning to start (125 MB/s, 3,000 IOPS) |
| Durability | writeConcern: "majority" |
| Load driver | a dedicated c6i.8xlarge (32 vCPU), same region + VPC, warm RTT well under 1 ms |
| Harness | purpose-built Go program on the official MongoDB Go driver |
Three of these are deliberate and worth calling out:
-
Same availability zone. All three nodes share one AZ. That's a benchmark
choice: it holds the inter-node round trip that every
majoritycommit waits on down to sub-millisecond. The honest tradeoff — this cluster is not AZ-fault-tolerant. A production HA deployment would spread nodes across AZs and pay a few extra milliseconds per commit for it; we're measuring the engine, so we removed that variable rather than hide it. - A separate machine drives the load. The client runs on its own 32-vCPU instance, in the same region and VPC as the cluster (warm round trip well under a millisecond). Keeping the driver off the database nodes — but on the same fast network — means the latency we report is the database's, not contention with the load generator or the public internet.
-
The harness is custom Go, not a generic tool. It uses the official MongoDB Go
driver and the closed-loop, think-time model above, fanning out across thousands of
goroutines. Go — garbage-collected, no GIL, cheap concurrency — keeps the client
from becoming the bottleneck, so we're measuring the server, not the load tool. It
runs the identical four-operation transaction two ways — as four separate
operations (per-op) and as a single
ClientBulkWrite— and that comparison runs through the entire series.
We seeded a million cards, a few thousand merchants, and the counter documents, then took a baseline.The floor looked great. A single uncontended transaction committed in ~5.4 ms — a snapshot read plus four writes plus the majority-commit round trips. That left roughly 14 ms of headroom under the SLA. The whole game became one question: does that headroom survive 5,000 concurrent TPS against a realistically full database?
On the fresh cluster, it did — early runs cleared the bar. It's tempting to stop there; plenty of benchmarks do, which is exactly why so many of them lie.
The wall
Then we made the cluster look like production. A payments ledger isn't empty; it's the append-only history of every authorization that ever happened. So we seeded it to 100 million ledger documents — at the heavy ~7 KB shape we were testing with at the time — and ran the same workload, unchanged.
p99 went to 234 ms. Every run in the battery failed. An order of magnitude over the SLA, from changing nothing but the amount of data already on disk.
| Fresh cluster | After seeding 100M ledger rows | |
|---|---|---|
| p99 | single-digit ms | ~234 ms |
| Battery result | passing | 0 / 5 |
| What changed | — | only the on-disk data volume |
This is the moment most teams either give up ("MongoDB can't do ACID at scale") or start randomly turning knobs. We did neither. The transaction logic was identical and correct; the body of the latency distribution was still healthy. Something about carrying 100 million rows of real data was poisoning the tail — and a tail that only appears under data volume points at one place: the path
between memory and disk.
The architect's lesson, stated once: an empty-database benchmark tells you
almost nothing. The interesting failures — cache pressure, checkpoint behavior,
storage saturation — only wake up when the working set and the on-disk volume
are realistic. Always benchmark against production-scale data, or you're
certifying a system you haven't actually tested.
What's coming
Part 2 tears the wall down one change at a time — each with the p99 it bought
and the evidence that justified it — and uses MongoDB's own diagnostics (FTDC) to
prove, not guess, that the bottleneck was the storage path, not CPU or RAM.
Part 3 doubles the RAM, watches the last of the tail disappear, and pushes past
the SLA to find the real ceiling — and what lies beyond it.
Companion code: sanitized, runnable benchmark harness. Part-2











