Curated developer articles, tutorials, and guides — auto-updated hourly


Benchmarking HiDream-O1-Image skeleton mode across 8 patterns reveals 3 counterintuitive findings ab...


On InfiniBand the data path never touches TCP, so the retransmit proxy reads zero. The measured...


In the previous post, I used a C++ CUDA example to look at memory coalescing and how memory access.....


How to go from 86 GiB idle VRAM (instant OOM) to 0 GiB idle / 40 GiB peak by using a cold-start desi...


TL;DR After del tensor; torch.cuda.empty_cache(), PyTorch's caching allocator still...


A data-driven look at 19 years of GPUs: ~400x FP32 growth, the datacenter TDP explosion, ~100x perf/...


Your GPU sits at 15% utilization and bigger batches don't help? Here's how to diagnose whether you'r...


How fp8_cast reduced LTX-2 22B peak VRAM from 40 GiB to 24 GiB in cold-start mode, and why optimum-q...


CUDA 13.3 Lands, AI Writes Blackwell Kernels, & FP4 VRAM Optimization for LLMs ...


Real-world timing benchmarks for HiDream-O1-Image Full — tuning steps, guidance scale, and resolutio...


AMD GPU/AI Launches, Legacy Driver Update & CUDA Optimization Platform ...


GPU waste in Kubernetes does not announce itself. Your cluster shows healthy utilization. Your...


Full pipeline: Gemma 4 31B expands a one-liner into a 10-beat script, HiDream generates images, LTX-...


Not a GPU unboxing. A real look at what 96GB VRAM enables for multi-model agent pipelines — and wher...


PatentLLM: CUDA TileLang/Triton B200 5x Speedup, RTX 5090 Power, PTX Grammar ...


RTX 5080 Undervolt Benchmarks, CGO-Free CUDA API Binding, & AMD GPU Compatibility...


FlashAttention CUDA Kernel, Strix Halo MOE Boost, & NVIDIA DLSS 4.5 Driver Update ...


Compare 5090 vs 4090 by VRAM, bandwidth, power, and real AI workflow fit, then decide whether to buy...


这段 GTC 研究员访谈视频由 SemiAnalysis 的 Kimbo Chen 主持,对话嘉宾是康奈尔大学助理教授、Makora(原名 Mako)的联合创始人兼首席科学官 Mohamed...


RTX 5090 Cooling, BeeLlama VRAM Opts, Resizable BAR Performance Gains Today's...


Intel Arc & Arm Mali: New GPUs, Drivers & Benchmarks for Linux Today's...


Decide whether a GPT endpoint belongs on Serverless GPU, a GPU Pod, or a VM by comparing traffic sha...


Build cost-effective serverless endpoints for Docker-based model inference by reducing idle GPU time...

RTX 4090 24GB runs LTX-Video faster than real-time. 5 GPUs ranked for image-to-video and text-to-vid...