Table of Contents
Fetching ...

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

Yifan Zhang, Zunhai Su, Shuhao Hu, Rui Yang, Wei Wu, Yulei Qian, Yuchen Xie, Xunliang Cai

TL;DR

SnapMLA introduces an FP8, hardware-aware framework for decoding in MLA-based LLMs, tackling KV-cache quantization heterogeneity, PV GEMM scale misalignment, and system-level dataflow bottlenecks. It combines RoPE-Aware Per-Token KV Quantization, Quantized PV Computation Pipeline Reconstruction, and End-to-End Dataflow Optimization to deliver substantial throughput gains while preserving accuracy on long-context tasks. The approach leverages per-token quantization, pre-scaled domain alignment, scale-fusion, and fused kernel designs to maximize Hopper GPU Tensor Core utilization, achieving up to a 1.91× throughput increase over FP8 baselines with minimal degradation in difficult benchmarks. Overall, SnapMLA enables efficient, scalable long-context MLA decoding with practical implications for real-time LLM serving on resource-constrained hardware.

Abstract

While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.

SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

TL;DR

SnapMLA introduces an FP8, hardware-aware framework for decoding in MLA-based LLMs, tackling KV-cache quantization heterogeneity, PV GEMM scale misalignment, and system-level dataflow bottlenecks. It combines RoPE-Aware Per-Token KV Quantization, Quantized PV Computation Pipeline Reconstruction, and End-to-End Dataflow Optimization to deliver substantial throughput gains while preserving accuracy on long-context tasks. The approach leverages per-token quantization, pre-scaled domain alignment, scale-fusion, and fused kernel designs to maximize Hopper GPU Tensor Core utilization, achieving up to a 1.91× throughput increase over FP8 baselines with minimal degradation in difficult benchmarks. Overall, SnapMLA enables efficient, scalable long-context MLA decoding with practical implications for real-time LLM serving on resource-constrained hardware.

Abstract

While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at https://github.com/meituan-longcat/SGLang-FluentLLM.
Paper Structure (48 sections, 9 equations, 7 figures, 2 tables)

This paper contains 48 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: End-to-end decoding throughput comparison. We evaluate the generation throughput of SnapMLA on DeepSeek-R1 and LongCat-Flash-Thinking across various parallelization configurations and context lengths. For additional details please refer to Section \ref{['sec:system_performance']}.
  • Figure 2: Overview of the scale fusion pipeline in SnapMLA. Note that $K$ and $V$ content share the latent cache ($\mathbf{c}_{KV}$). Key Step 1 illustrates RoPE-Aware Per-Token KV Quantization, where the BF16 RoPE part of the QK is pre-scaled with the quantization scale of the content part, thereby unifying the numerical domains of the quantized content part and unquantized RoPE part. Key Step 2 demonstrates Quantized PV Computation Pipeline Reconstruction, where the scale of $V$ is pre-fused into the $P$, circumventing quantization dimension mismatch.
  • Figure 3: Analysis of the numerical value distribution and quantization error comparison for the content and RoPE components of MLA KV cache in LongCat-Flash-Thinking.
  • Figure 4: Layer-wise numerical fidelity analysis (context length = 32k). For details on the quantization configurations, please refer to Table \ref{['tab:Numerical Accuracy']}.
  • Figure 5: Kernel-level compute performance (TFLOPS). We measure the compute throughput of SnapMLA (blue hatched) versus the FlashMLA baseline (gray) across varying sequence lengths. The workload configurations are derived from the corresponding end-to-end DP/TP settings. Our kernel closely tracks the trajectory of the effective theoretical peak.
  • ...and 2 more figures