This article was originally published on aifoss.dev
TL;DR: vLLM-ATOM is AMD's open-source (MIT) plugin that slots AITER-accelerated kernels β fused attention, quantized GEMM, fused MoE β under vLLM on Instinct MI350/MI400 GPUs, with zero changes to your vLLM commands or API. It's a data-center play: consumer Radeon owners get nothing here yet. Worth it if you run Instinct silicon or rent it.
What you'll have running after this guide:
- A vLLM server backed by ATOM's AITER kernels on an AMD Instinct GPU
- A drop-in OpenAI-compatible endpoint (same
vllm serveworkflow you already use) - A clear read on whether your hardware can use it β and what to do if it can't
| vLLM-ATOM (plugin) | Upstream vLLM ROCm | llama.cpp ROCm | |
|---|---|---|---|
| Best for | Instinct MI350/MI400 inference | Any ROCm GPU, stable path | Single-GPU / consumer Radeon |
| Kernels | AITER: fused attn, quant GEMM, fused MoE | Triton + partial AITER | HIP ports of CPU/CUDA kernels |
| Hardware focus | Instinct (MI350, MI355X, MI400) | Instinct + some Radeon | Radeon + Instinct |
| Setup | Docker image or pip plugin | pip install vllm |
Compile with -DGGML_HIP=ON
|
| License | MIT | Apache 2.0 | MIT |
Honest take: If you're on Instinct hardware (owned or rented), ATOM is the fastest path to AMD-native kernels without rewriting anything. If you're a home-labber on a Radeon RX card, skip it β the upstream ROCm backend or llama.cpp is your lane until these kernels get upstreamed.
What vLLM-ATOM actually is
AMD announced vLLM-ATOM on May 7, 2026 on the ROCm blog. The short version: vLLM is the de-facto open-source inference server, but its highest-performance kernels were written for NVIDIA first. ROCm support has historically lagged. ATOM is AMD's answer β a plugin that injects AMD-native kernels into vLLM without forking the project or breaking the API.
The design is three layers, and understanding them tells you exactly what ATOM does and doesn't change:
-
Top layer β vLLM. Request scheduling, batching, the OpenAI-compatible server, and the compatibility interface. Untouched. Your
vllm servecommands, your/v1/chat/completionscalls, your sampling params β all identical. - Middle layer β ATOM plugin. Model implementation and kernel selection. This is where ATOM swaps in optimized attention, GEMM, and MoE routing for the architectures it supports.
- Bottom layer β AITER. AMD's kernel library that talks directly to the GPU. Flash Attention, quantized GEMM, and fused MoE land here, plus custom AllReduce for multi-GPU.
Because the top layer is stock vLLM, ATOM keeps the full feature set production deployments depend on: continuous batching, prefix caching, tensor parallelism, structured output. That's the entire pitch β AMD-native speed without giving up vLLM's ergonomics.
It's MIT-licensed (the ROCm/ATOM repo), which is more permissive than vLLM's own Apache 2.0. No commercial-use asterisks.
The hardware reality (read this before you install anything)
ATOM is built for AMD Instinct data-center accelerators: MI350, MI355X (which adds FP4), and the MI400 series with rack-scale inference. The README lists "AMD GPU with ROCm support" generically, but the kernels and the shipped Docker image target Instinct. The base image is rocm/pytorch:rocm7.0.2_ubuntu24.04_py3.12_pytorch_release_2.8.0 β ROCm 7.0.2, PyTorch 2.8.0.
If you own a Radeon RX 7900 XTX or a 9070, ATOM is not aimed at you in mid-2026. The AITER kernels are tuned for CDNA Instinct, not RDNA consumer parts. You won't get a clean error that says "wrong GPU" so much as missing kernel paths and fallbacks that defeat the purpose.
For most aifoss readers the practical way to touch Instinct hardware is to rent it. An MI300X/MI350 instance on RunPod lets you test ATOM for the cost of an hour, which is the right move before committing to anything. If you're weighing a local Instinct box against cloud rental, our self-hosted vs SaaS cost breakdown covers the math, and runaihome.com has the GPU-server hardware side.
Install: the Docker path (recommended)
AMD ships a nightly dev image, and that's the least painful way in β it pins a known-good ROCm + PyTorch + AITER + ATOM combination so you're not chasing version drift.
docker pull rocm/atom-dev:latest
docker run -it --network=host \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME:/home/$USER \
-v /mnt:/mnt \
-v /data:/data \
--shm-size=16G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
rocm/atom-dev:latest
The --device=/dev/kfd and --device=/dev/dri flags expose the AMD kernel-fusion driver and the render node β both are required for ROCm inside a container. --shm-size=16G matters: vLLM uses shared memory for tensor-parallel communication, and the default 64MB will crash multi-GPU runs.
Install: the pip path (if you manage your own ROCm)
If you already have a working ROCm 7.0.x environment and don't want a container, install AITER and ATOM directly:
pip install amd-aiter
git clone https://github.com/ROCm/ATOM.git
pip install ./ATOM
ATOM ships on a bi-weekly paired-release cadence with AITER β release v0.1.4 (June 6, 2026) was paired with AITER v0.1.15. Match the versions. AMD reverted two PRs during v0.1.4 validation specifically over AITER compatibility, which tells you the pairing isn't optional cosmetic guidance β mismatched AITER and ATOM will bite you.
Running a model through the ATOM backend
ATOM registers itself as an out-of-tree plugin backend for vLLM. Once installed, you select it and otherwise run vLLM exactly as you always have:
VLLM_ATTENTION_BACKEND=ATOM \
vllm serve deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 \
--port 8000
A successful boot looks like the normal vLLM startup, with ATOM/AITER kernels logged during init:
INFO ... Using ATOM plugin backend (AITER kernels)
INFO ... Loading model weights ...
INFO ... Started server process
INFO ... Uvicorn running on http://0.0.0.0:8000
From there it's a stock OpenAI endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"deepseek-ai/DeepSeek-V3","messages":[{"role":"user","content":"Say hi in 3 words"}]}'
If you've followed our vLLM setup guide or vLLM production setup, nothing above is new β that's the whole point. The Nginx, auth, and multi-model patterns from those guides carry over unchanged, because the server layer is identical.
What you actually get from AITER
The acceleration lives in three kernel families:
- Fused attention β Flash-Attention-style fused kernels tuned for CDNA, cutting memory traffic on the attention path.
- Quantized GEMM β optimized low-precision matrix multiply, including FP4 on MI355X. If you've read our GPTQ vs AWQ vs GGUF for vLLM breakdown, this is the kernel side of why 4-bit serving is fast.
- Fused MoE routing β for Mixture-of-Experts models, the expert dispatch/gather is fused instead of run as separate ops, which is where MoE inference usually bleeds time.
AMD has not published an apples-to-apples public token/s table comparing ATOM against upstream vLLM ROCm for a fixed model and batch, so I'm not going to quote a speedup number β treat any "Nx faster" claim you see elsewhere with suspicion until there's a reproducible benchmark. What's verifiable is the kernel coverage and the architecture, not a headline multiplier.
Supported models
ATOM's model table covers dense and MoE families: Llama 2 / 3 / 3.1, Qwen3 (dense and MoE variants), DeepSeek V2/V3, and Mixtral, with vision-language models in scope too. AMD's launch materials sp









