We run a GPU catalog and have built up a database of 13,566 GPUs β from the GeForce 256 (1999) to Blackwell and the MI355X (2025). At some point it got interesting to look not at "which card is faster," but at how the whole industry shifted: how much FLOPS grew, where TDP hit a wall, and who led the NVIDIA-vs-AMD race in different years.
Below is a breakdown from our own data. Two things I'll put on the table right away: the methodology (what I measured and how, where the data is noisy) and an open dataset at the end of the article β grab it and dig in with us π
TL;DR
- Peak FP32 of the flagship grew ~400Γ in 19 years: 0.3 TFLOPS (GeForce 8800 GTX, 2006) β 126 TFLOPS (Blackwell, 2025). It's an almost perfectly straight line on a semi-log scale.
- TDP crept up slowly (155 β 300 W over 2006β2020), then exploded in the datacenter: 700 W (H100), 1000 W (MI325X / B200), 1400 W (MI355X, 2025).
- Yet performance per watt grew ~100Γ β they "draw more," but "do far more per watt." The main driver is the process node (90 nm β 3 nm) plus architecture.
- The NVIDIA/AMD duel by peak FP32 moved in waves: AMD led in the early 2010s (GCN era) and again in 2023β24 (Instinct MI300/MI325), NVIDIA in 2016β2020 (the AI pivot) and in 2025 (Blackwell). But "raw FP32" is a misleading metric β more on that below.
Methodology
- What these TFLOPS are and why they're "theoretical." Every FP32 number in this article is the theoretical peak that vendors compute with the formula:
FP32 TFLOPS = (shader ALUs / CUDA cores) Γ boost clock (Hz) Γ 2 / 10^12
The Γ2 is because an FMA (fused multiply-add) does a multiply and an add in one cycle β two operations. This is a ceiling, not real-world throughput: in practice you reach noticeably less β typically 60β90% on well-optimized compute-bound kernels and a fraction of that on memory-bound ones β because memory bandwidth, SM occupancy, instruction mix, and the fact that boost clocks don't hold under sustained load and thermal limits all get in the way. Theory diverging from practice is normal. The theoretical peak is valuable for a different reason: it's computed by one formula across every card and generation, so it's a fair comparable yardstick for a historical look β that's what spec sheets list, and what we use. Real performance is measured with benchmarks (they're a separate table in the dataset).
- The source is our specification database. "Flagship of the year" = the card with the maximum
fp32_performancereleased that year, tracked separately for NVIDIA and AMD. - For the TDP/efficiency curves I excluded dual-GPU cards (GTX 295, HD 6990, R9 295X2, etc.) β otherwise TDP and FLOPS double up and break the trend.
- Where the data is noisy:
vendoris filled in for ~2,360 of 13,566 cards (the rest are mostly OEM partner-board variants). Medians use the labeled subset; flagship peaks are fully labeled. And FP16/tensor performance is not directly comparable between vendors β because of structured sparsity. Starting with Ampere (A100), NVIDIA quotes tensor FP16/BF16 in its spec sheets with sparsity already applied β that's 2Γ the dense value (the feature processes sparse matrices twice as fast). Our database stores exactly this "sparse" figure for such cards. AMD has no equivalent spec line β those are dense. So NVIDIA's raw FP16 column (A100+) has to be halved to compare fairly with AMD: A100 = 624 (sparse) β 312 dense, H100 = 1979 β ~990 dense. The "AI inflection" part below relies on these dense-normalized numbers.
1. FLOPS: an almost perfectly straight exponential
Peak FP32 of the single flagship by year (NVIDIA):
| Year | Flagship | FP32, TFLOPS |
|---|---|---|
| 2006 | GeForce 8800 GTX | 0.3 |
| 2010 | GeForce GTX 580 | 1.6 |
| 2013 | GeForce GTX 780 Ti | 5.3 |
| 2016 | Quadro P6000 | 12.6 |
| 2017 | Tesla V100 | 15.7 |
| 2020 | RTX A6000 | 38.7 |
| 2022 | L40S | 91.6 |
| 2025 | RTX PRO 6000 Blackwell | 126.0 |
β400Γ in 19 years is a CAGR of about 37% per year. On a semi-log scale the line is almost straight: a classic exponential that has only recently started bending on the "desktop" segment and moved into the datacenter.
2. TDP: a quiet climb, then a datacenter explosion
| Year | Card | TDP, W |
|---|---|---|
| 2006 | GeForce 8800 GTX | 155 |
| 2010 | GTX 580 | 244 |
| 2017 | Tesla V100 | 250 |
| 2020 | RTX A6000 | 300 |
| 2022 | H100 SXM | 700 |
| 2024 | MI325X / B200 | 1000 |
| 2025 | MI355X | 1400 |
For a decade and a half the flagship TDP stayed in a 150β300 W band. The break comes after 2020, and it's entirely datacenter-driven: AI accelerators (SXM/OAM modules) shot up to 700β1400 W because they're cooled by liquid in a rack, not by a fan in a case. The desktop ceiling separately hit ~450β600 W (RTX 4090/5090).
There's a curious gap if you look at NVIDIA's consumer flagships separately: the GeForce flagship sat at exactly 250 W for seven years (2013β2019) β GTX 780 Ti, Titan X, 1080 Ti, 2080 Ti β and only broke that ceiling with the RTX 3090 (350 W, 2020), then 4090 (450 W) and 5090 (575 W). Datacenter accelerators, by contrast, went to 700β1400 W almost immediately. It looks like what capped gaming TDP wasn't the silicon so much as the market β cases, PSUs, and buyer habits; in a rack there are no such limits, and watts grew without looking back. (This is interpretation: the spec stores watts, not intentions β but a 250 W plateau across seven generations shows up clearly in the data.)
3. Performance per watt: this is where the progress is
If you only look at TDP, it feels like "everything's getting worse, cards guzzle power." But FP32 per watt tells the opposite story:
| Year | Flagship | TFLOPS/W |
|---|---|---|
| 2006 | 8800 GTX | 0.002 |
| 2013 | GTX 780 Ti | 0.021 |
| 2016 | Quadro P6000 | 0.051 |
| 2020 | RTX A6000 | 0.129 |
| 2022 | L40S | 0.262 |
| 2025 | RTX PRO 6000 Blackwell | 0.21 |
~100Γ in efficiency. Peak "classic" efficiency lands in 2022 (Ada/L40S); the 2024β25 datacenter cards sometimes lose on TFLOPS/W because they deliberately trade efficiency for absolute compute density in the rack. The main drivers of efficiency gains are the process node (90 nm β 3 nm) and architectural improvements, not clocks.
4. The NVIDIA vs AMD duel
If you mark, year by year, whose single flagship had the higher FP32:
| Period | Leader | Context |
|---|---|---|
| 2007β2008 | AMD | FireStream 9170/9270 |
| 2010β2013 | AMD | GCN: HD 6970, HD 7970 GHz, R9 290X |
| 2014 | NVIDIA | Titan Black (5.6) vs FirePro W9100 (5.2) |
| 2015 | AMD | Fury X (8.6) |
| 2016β2020 | NVIDIA | Pascal β Ampere, the AI pivot |
| 2021 | AMD | Instinct MI250X (47.9) |
| 2022 | NVIDIA | L40S / Hopper |
| 2023β2024 | AMD | Instinct MI300A/MI325X (81.7) |
| 2025 | NVIDIA | Blackwell (126) |
The picture is wavy, and I included it mostly for the intrigue β to give AMD at least a fighting chance. Because on raw FP32, AMD took the lead regularly β in the GCN era and again on recent Instinct parts. But raw FP32 is exactly the deceptive metric for today's world. The AI era is won not on FP32, but on software and FP16/BF16/FP8. Here NVIDIA, with tensor cores (since V100, 2017) and the CUDA ecosystem, built a moat that the FP32 numbers alone don't reveal: V100 delivered ~125 TFLOPS tensor-FP16, A100 ~312, H100 ~990 (vendor public data). In other words, the "FP32 duel" is about the past β the GPU as a graphics accelerator; the real battle has moved to a plane FP32 doesn't measure.
So, here's one more chart β the FP16 duel, where NVIDIA is consistently ahead. And once you layer the AI software stack on top of thatβ¦
5. What else the data shows
- Process node: 90 nm (2006) β 28 nm (a 2012β2015 plateau, the "stuck node") β 16/12/7 β 3 nm (MI355X, 2025).
- Flagship VRAM: 0.77 GB (8800 GTX) β 12β24 GB (mid-2010s) β 48 GB (A6000) β 192β288 GB (MI300/MI355X). Memory grows even faster than compute β because AI models are bottlenecked on it.
- The "stuck" 28 nm: for four years (2012β2015) the industry sat on one node β and that's exactly when AMD held parity/leadership on FP32. As soon as the process-node sprint resumed and tensor cores appeared, the advantage swung to NVIDIA.
Open dataset β take it
We've published a cleaned dump of our GPU spec database for anyone who wants to dig in themselves:
π¦ Download: gpuark.com/datasets β the files gpuark-gpu-specs.csv, gpuark-benchmarks.csv, gpuark-gpu-dataset.sqlite, or everything in a single gpuark-gpu-dataset.tar.gz archive.
-
13,566 GPUs (fields: vendor, manufacturer, release date, architecture, process node, transistors, clocks, memory size and type, bus, FP16/FP32/FP64/BF16/TF32/INT8, TDP, NVLink, CUDA SM, and more) + 993 third-party benchmark results (join on
gpu_id). - Formats: CSV (Excel/pandas) and SQLite (ready-made SQL) β two tables,
gpu_specsandbenchmarks. - License: CC BY 4.0 (attribution to gpuark.com).
If you'd rather explore interactively before downloading, the same data powers the GPU comparison tool on the site.
Takeaways
- FLOPS grew as an almost perfect exponential (~37%/yr) β but the "free" growth is over; from here we pay with TDP and a move into the rack.
- Real progress is measured not in watts and not in raw FP32, but in performance per watt (Γ100) β and that rides on the process node.
- AMD fought and led on the "raw" numbers more often than people think; but the AI era was defined by tensor + software, not FP32.
The data is open β if you find something in it we missed, let me know.

















![Local LLM Hardware in 2026: 3-Way GPU War [Guide]](https://media2.dev.to/dynamic/image/width=1200,height=627,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdioakz1jpu8cty7zndd.png)
