Table of Contents
Fetching ...

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

TL;DR

A cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs demonstrates that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both.

Abstract

We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural families (MoE+MLA, Dense+GQA, MoE+GQA) on an 8-GPU cluster with 2TB aggregate HBM3e using vLLM v0.14.1. Our results demonstrate that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both. The AMD AITER runtime is required for competitive MLA inference throughput and must be selectively disabled for architectures with incompatible attention head configurations. A controlled AITER ablation on Llama-3.1-405B (n=5 per condition) reveals a modest 3-5% throughput benefit at high concurrency but 2-16x higher measurement variability, confirming that AITER's large speedups target MoE/MLA kernels specifically. Under text-only workloads, Llama-405B and DeepSeek V3.2 achieve comparable peak throughput (15,944 and 15,343 tok/s) despite an order-of-magnitude difference in active parameters. Under vision workloads, Qwen3-VL-235B reaches 47,873 tok/s, 6.5x higher than Kimi-K2.5 (7,327 tok/s). Active parameter count per token is associated with inference throughput, though confounded by differences in quantization, AITER acceleration, and tensor parallelism. All four models exhibit a common throughput saturation point consistent with a memory-bandwidth bottleneck (~500 concurrent for short sequences, ~100-200 for longer sequences). All models maintain 100% HTTP-level success rates through 1,000 concurrent users, processing 18.9 million tokens across 17,406 requests without failures.

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

TL;DR

A cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs demonstrates that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both.

Abstract

We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural families (MoE+MLA, Dense+GQA, MoE+GQA) on an 8-GPU cluster with 2TB aggregate HBM3e using vLLM v0.14.1. Our results demonstrate that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both. The AMD AITER runtime is required for competitive MLA inference throughput and must be selectively disabled for architectures with incompatible attention head configurations. A controlled AITER ablation on Llama-3.1-405B (n=5 per condition) reveals a modest 3-5% throughput benefit at high concurrency but 2-16x higher measurement variability, confirming that AITER's large speedups target MoE/MLA kernels specifically. Under text-only workloads, Llama-405B and DeepSeek V3.2 achieve comparable peak throughput (15,944 and 15,343 tok/s) despite an order-of-magnitude difference in active parameters. Under vision workloads, Qwen3-VL-235B reaches 47,873 tok/s, 6.5x higher than Kimi-K2.5 (7,327 tok/s). Active parameter count per token is associated with inference throughput, though confounded by differences in quantization, AITER acceleration, and tensor parallelism. All four models exhibit a common throughput saturation point consistent with a memory-bandwidth bottleneck (~500 concurrent for short sequences, ~100-200 for longer sequences). All models maintain 100% HTTP-level success rates through 1,000 concurrent users, processing 18.9 million tokens across 17,406 requests without failures.
Paper Structure (125 sections, 8 figures, 30 tables)

This paper contains 125 sections, 8 figures, 30 tables.

Figures (8)

  • Figure 1: Throughput normalized by active parameter count (tok/s per billion active parameters), grouped by workload type. Left pair: text-only workload; right pair: vision workload. Values are not directly comparable across workload types because vision total tok/s includes image tokens. Y-axis uses logarithmic scale. Data from primary stress-test benchmark ($3\times$ multiplier).
  • Figure 2: Text-only workload
  • Figure 3: Vision workload
  • Figure 5: p99 latency as a function of concurrent requests. Text models (Llama-3.1-405B, DeepSeek V3.2) use a 500-token input / 100-token output workload; vision models (Qwen3-VL, Kimi-K2.5) use a 100-token input + 1 image / 200-token output workload. Latency values are not directly comparable across workload types. All models show sublinear latency growth: throughput increases faster than latency, yielding positive scaling efficiency at all tested concurrency levels. Inset shows the 0--15s range for Qwen3-VL, Llama-3.1-405B, and DeepSeek V3.2 (Kimi-K2.5 latencies of 25--103s compress these curves in the main plot). Data from primary stress-test benchmark ($3\times$ multiplier).
  • Figure 6: Throughput vs. concurrency with 95% confidence interval error bars from $n{=}5$ independent runs per model. DeepSeek V3.2 exhibits the widest error bars (CoV up to 11.7% at peak concurrency, reaching 50.8% at concurrency 10), while Qwen3-VL and Kimi-K2.5 show near-deterministic behavior. Data from multi-run reproducibility workload (100 requests, 2,048 input / 512 output tokens); not directly comparable to the primary stress-test benchmark.
  • ...and 3 more figures