Where tensor-parallel inference hits the NVLink wall

2026-05-31 · GPU / distributed systems

Tensor parallelism splits each layer across GPUs, so every forward pass pays for an
all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and
how close you get to its theoretical budget decides whether TP helps or hurts. This post
measures it on 4× H100 and explains where the wall is.

Repo with the full harness and CSVs:
nccl-collectives-bench.

What was measured

A bandwidth sweep (message size 8 B → 8 GB) of the three collectives that bound distributed
LLM work — all-reduce, all-gather, reduce-scatter — driving the canonical
nvidia/nccl-tests and adding a parser + analysis layer on top. The headline number:

All-reduce bus bandwidth ≈ 366 GB/s, about 77 % of the per-GPU NVLink uni-directional budget on this box. That 77 % is the practical ceiling TP communication runs into; the remaining gap is protocol overhead and the algorithm's traffic multiplier.
Algorithm ranking at large messages: NVLS > Ring > Tree. NVLink SHARP (NVLS) offloads the reduction into the switch, which is why it pulls ahead once messages are big enough to amortise setup.
A protocol study (Simple / LL / LL128) showing the small-message latency floor — the regime that actually matters for decode, where each token's all-reduce is tiny.

Why it matters for inference

Training all-reduces gradients on big tensors, so it lives in the bandwidth-bound regime
where 366 GB/s is good news. Decode is the opposite: one token at a time means small
messages, so you're pinned against the latency floor, not the bandwidth ceiling. That is the
real "TP wall" — past a certain TP degree, the per-token all-reduce latency dominates and
adding GPUs makes decode slower, not faster.

The repo also includes an eager-vs-CUDA-Graph comparison of that decode latency wall:
capturing the per-token step as a graph removes launch overhead that would otherwise be
indistinguishable from communication cost — a reminder to measure the right thing before
blaming the fabric.

Takeaway

"Use tensor parallelism" is not free advice. Measure the all-reduce on your fabric, know
your 77 %, and know that the number that decides decode latency is the small-message floor —
not the big-message bandwidth everyone quotes.

→ Methodology, raw CSVs, and the roofline analysis:
github.com/waynehacking8/nccl-collectives-bench

Where Tensor-Parallel Inference Hits the NVLink Wall

Where tensor-parallel inference hits the NVLink wall

What was measured

Why it matters for inference

Takeaway

Tags

Author

Stats

Published

You Might Also Like

Bypassing the OS to Run LLMs: What I Learned Building a Firmware-Centric Runtime

I Built Flash Attention From Scratch — Here's What Nobody Tells You About It

ComfyUI 'Torch not compiled with CUDA enabled'? Every Fix That Works on Windows, Linux, and Mac (2026)

CUDA Out of Memory on Local AI? Every Fix That Works for Ollama, llama.cpp, ComfyUI, and vLLM (2026)