Scaling Up: Doubling the RAM, Killing the Tail, and Finding the Real Ceiling

Part 3 of 3 — How We Got MongoDB to 5,000 Strict-ACID TPS at p99 ≤ 20 ms

The short version: 64 GB got us to 9.8 ms, but the per-op variant still
passed only 3 of 5. Doubling the RAM to 128 GB eliminated the checkpoint tail
entirely — per-op p99 max dropped ~30×, both variants passed 5/5, and the
result went essentially deterministic. Then we pushed past 5,000 TPS to find
where it breaks — and the one code change that moves that ceiling.

The itch that 64 GB left

End of Part 2: client-bulk was a clean 9.8 ms, 5/5. But the per-op variant —
four ordinary writes, no batching — was only passing 3 of 5. Its body was
rock-solid (median ~15 ms), but the worst run spiked to 346 ms. The checkpoint
tail we'd mostly tamed was still occasionally catching the variant that issues 4×
as many disk-going operations per transaction.

At 64 GB we'd run out of road: gp3 was already maxed (Part 2), client-bulk passed,
but per-op's tail still occasionally landed on a checkpoint burst. The one input we hadn't increased was memory — and on a 100M-row dataset, memory decides how much of the data is served from cache versus fetched from disk.

That gave us a clear, testable hypothesis. WiredTiger sizes its cache at roughly half of RAM, so doubling RAM roughly doubles the cache. A bigger cache serves more reads from memory, which frees up disk I/O the checkpoint flushes had been competing for. More RAM wouldn't make the disk one byte faster — the bet was that it would leave more of the disk free for the checkpoint bursts that were producing the tail. The only way to know was to try it.

The upgrade

Same vCPU count (16), double the RAM: 64 → 128 GB (the m7i.4xlarge →
r7i.4xlarge class). We let WiredTiger pick its default cache (~61 GB) rather
than the hand-tuned 50 GB. Rolled one node at a time — EBS volumes persist across
an instance swap, so no reseed, no downtime beyond a rolling election.

The result wasn't a marginal improvement. It was the tail simply vanishing:

Metric	64 GB / 50 GB cache	128 GB / ~61 GB cache	Δ
client-bulk p99 median	9.80 ms	6.94 ms	−29%
client-bulk p99 stddev	2.41 ms	0.11 ms	22× tighter
per-op pass rate	3/5	5/5	+2
per-op p99 median	15.40 ms	7.53 ms	−51%
per-op p99 max	346 ms	11.7 ms	~30× lower
per-op p99 stddev	156 ms	1.86 ms	~84× tighter

Both variants now passed 5/5 at p99 ≤ ~12 ms, and client-bulk's p99 varied by
0.11 ms across five runs — for practical purposes, deterministic. The extra
cache gave dirty pages room to breathe between checkpoints, so the flush bursts
stopped colliding with the steady write stream. The tail didn't shrink; it
disappeared.

Architect takeaway: RAM isn't just working-set caching. On a write-heavy
workload, cache headroom directly governs checkpoint behavior — and therefore the
tail. If your p99 max is wild but your median is fine, look at cache-vs-dirty
headroom before you blame the query.

Now break it on purpose: the scaling study

Passing the SLA is table stakes. An architect wants to know the headroom — how
far past the target before it falls over. So we pushed the same cluster past 5,000
TPS with longer sustained runs:

Target TPS	Per-op p99	ClientBulkWrite p99
5,000	~7.5 ms ✅	~9.3 ms ✅
6,000	~24 ms ❌	~10 ms ✅
7,000	~640 ms ❌	~16 ms ✅

Per-op tops out right around its design point of 5K. Client-bulk holds the SLA all
the way to 7,000 TPS on the very same hardware — a 2.4× latency advantage at
6K, and ~39× at 7K.

The reason is round trips. The per-op transaction makes four separate writes; at
6,000 TPS that's ~24,000 write operations/second hitting the path. ClientBulkWrite
(MongoDB's client-level bulk write, which spans multiple collections in one
command) collapses the four into a single round trip — so 7,000 TPS is ~7,000
commands/second on the write path instead of ~28,000. Same transaction, same ACID
guarantees, one quarter of the operations.

For your capacity plan: if you expect to grow past your design TPS,
ClientBulkWrite is the highest-leverage change you can make — it bought us from
"breaks at 6K" to "passes at 7K" with no new hardware and zero change to the
guarantees. Reach for it before you reach for bigger nodes or a shard.

The ceiling — and what's beyond it

Even client-bulk eventually meets the same gp3 wall from Part 2: at the top of the
range, the checkpoint bursts brush the volume's maximum again. We deliberately
stopped at the gp3 ceiling because that's the honest edge of this configuration.
Beyond it, three known levers — in rough cost order:

WiredTiger checkpoint tuning (wiredTigerEngineRuntimeConfig) — more frequent, smaller flushes. Free; just config.
io2 storage — up to 64K IOPS, enough to absorb the checkpoint bursts that exceed gp3's 16K ceiling. ~4× the storage cost.
Sharding — distributes checkpoint timing across multiple primaries, so no single write path is the bottleneck. The answer when one replica set genuinely isn't enough.

The whole journey, in one view

Stage	p99	Pass
100M ledger seeded (the wall)	~234 ms	0/5
Storage provisioning maxed (gp3 16K IOPS / 1000 MB/s)	~218 ms	0/5
Right-size ledger doc (400 → ~110 fields, ~7 KB → ~2.5 KB) → stabilized, 64 GB	~9.8 ms	5/5 (per-op 3/5)
128 GB RAM, default ~61 GB cache	~6.9 ms	5/5 (per-op 5/5)
Headroom (client-bulk)	≤ 16 ms	holds to 7,000 TPS

What this proves

Strict, fully-durable, multi-document ACID at 5,000 TPS and single-digit-ms p99 is
real — but getting there was an engineering problem with an ordered set of
levers, not a feature you switch on:

Diagnose with the database's own telemetry (FTDC) — it was checkpoint I/O, not CPU or RAM.
Benchmark the real document — ours was ~3× the production size; matching the actual ~110-field ledger schema was the single biggest p99 drop.
Provision storage to match the bursts — and know the ceiling of your volume type.
Size RAM for cache headroom — that's what kills the tail on write-heavy work.
Batch round trips (ClientBulkWrite) — that's what raises the throughput ceiling.

None of these is exotic. All of them are reproducible. And the binding constraint,
at every stage, was the storage-and-memory path — never the correctness or the
transaction model itself.

This concludes the three-part journey. Companion code: sanitized, runnable benchmark harness.