Table of Contents
Fetching ...

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli

TL;DR

TyphoonMLA targets the decode-time inefficiency of Multi-Head Latent Attention (MLA) by leveraging data reuse from shared KV-cache prefixes. It fuses naive (compute-efficient in shared regions) and absorb (memory-efficient in non-shared regions) formulations into a single hybrid kernel, with prefill and decode stages that maintain equivalence to standard MLA while reducing MACs and HBM traffic. A key result is the derivation of a batch-size threshold $B_ heta$ that governs when to favor naive versus absorb components, enabling consistent speedups; experiments on NPUs and GPUs show up to 3.2× speedups with a ~3% memory overhead and no accuracy loss due to the method’s mathematical equivalence. This approach enables more efficient MLA inference and is compatible with existing optimization and distribution strategies, supporting deployment at scale without retraining. $B_ heta = rac{(D_{qk}+D_v)}{S_q(2D_l+D_r)} rac{T}{M}$ illustrates the trade-off between compute and memory in shared-prefix scenarios.

Abstract

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

TL;DR

TyphoonMLA targets the decode-time inefficiency of Multi-Head Latent Attention (MLA) by leveraging data reuse from shared KV-cache prefixes. It fuses naive (compute-efficient in shared regions) and absorb (memory-efficient in non-shared regions) formulations into a single hybrid kernel, with prefill and decode stages that maintain equivalence to standard MLA while reducing MACs and HBM traffic. A key result is the derivation of a batch-size threshold that governs when to favor naive versus absorb components, enabling consistent speedups; experiments on NPUs and GPUs show up to 3.2× speedups with a ~3% memory overhead and no accuracy loss due to the method’s mathematical equivalence. This approach enables more efficient MLA inference and is compatible with existing optimization and distribution strategies, supporting deployment at scale without retraining. illustrates the trade-off between compute and memory in shared-prefix scenarios.

Abstract

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: a) The naive formulation of MLA. b) The absorb formulation of MLA. c) The prefill and decode stages of TyphoonMLA.
  • Figure 2: Benchmark results on Ascend NPUs. Y-axes represent normalized throughput in terms of the number of generated tokens per second per layer. Some data points for baselines are missing as their memory footprint exceeds the HBM capacity.
  • Figure 3: Benchmark results on GPU for various batch sizes. Y-axes represent throughput in terms of the number of generated tokens per second per layer.
  • Figure 4: Latency breakdown of TyphoonMLA (bars on the left-hand side) and CATLASS absorb-only baseline (bars on the right-hand side) for Kimi K2 architecture. Stage 1 and Stage 2 represent the naive and absorb parts of TyphoonMLA, and have sequence lengths of 4096 and 512, respectively.
  • Figure 5: HBM footprint comparison for DeepSeek-v3 in FP8 precision for both weights and KV-cache.
  • ...and 3 more figures