TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Ahmet Caner Yüzügüler; Ahmet Çelik; Jiawei Zhuang; Lukas Cavigelli

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

Ahmet Caner Yüzügüler, Ahmet Çelik, Jiawei Zhuang, Lukas Cavigelli

TL;DR

TyphoonMLA targets the decode-time inefficiency of Multi-Head Latent Attention (MLA) by leveraging data reuse from shared KV-cache prefixes. It fuses naive (compute-efficient in shared regions) and absorb (memory-efficient in non-shared regions) formulations into a single hybrid kernel, with prefill and decode stages that maintain equivalence to standard MLA while reducing MACs and HBM traffic. A key result is the derivation of a batch-size threshold $B_ heta$ that governs when to favor naive versus absorb components, enabling consistent speedups; experiments on NPUs and GPUs show up to 3.2× speedups with a ~3% memory overhead and no accuracy loss due to the method’s mathematical equivalence. This approach enables more efficient MLA inference and is compatible with existing optimization and distribution strategies, supporting deployment at scale without retraining. $B_ heta = rac{(D_{qk}+D_v)}{S_q(2D_l+D_r)} rac{T}{M}$ illustrates the trade-off between compute and memory in shared-prefix scenarios.

Abstract

Multi-Head Latent Attention (MLA) is a recent attention mechanism adopted in state-of-the-art LLMs such as DeepSeek-v3 and Kimi K2. Thanks to its novel formulation, MLA allows two functionally equivalent but computationally distinct kernel implementations: naive and absorb. While the naive kernels (e.g., FlashAttention) are typically preferred in training and prefill for their computational efficiency, existing decoding kernels (e.g., FlashMLA) rely on the absorb method to minimize HBM bandwidth usage. However, the compute-bound nature of the absorb implementations prohibits performance benefits from data reuse opportunities in attention calculations, such as shared prefixes. In this work, we introduce TyphoonMLA, a hybrid approach that combines naive and absorb formulations to harness the strengths of both. TyphoonMLA effectively leverages the shared prefix by applying the naive formulation to the compute-bound parts of attention calculations, while reducing the bandwidth requirements for non-shared parts by using the absorb formulation. As a result, TyphoonMLA improves the throughput of attention calculations in MLA architectures by up to 3x and 3.24x on NPU and GPUs, with only a 3% overhead in HBM size.

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

TL;DR

Abstract

TyphoonMLA: A Mixed Naive-Absorb MLA Kernel For Shared Prefix

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)