Table of Contents
Fetching ...

AMLA: MUL by ADD in FlashAttention Rescaling

Qichen Liao, Chengqiu Hu, Fangzheng Miao, Bao Li, Yiyang Liu, Junlong Lyu, Lirui Jiang, Jun Wang, Lingchao Zheng, Jun Li, Yuwei Fan

TL;DR

This work tackles the decode‑phase bottlenecks of Multi‑Head Latent Attention (MLA) by introducing Ascend MLA (AMLA), a co‑designed kernel for Huawei Ascend NPUs. It replaces FP32 multiplications in the output rescaling with integer additions via a binary FP32–INT32 reinterpretation (F × 2^n = AS_FP32(AS_INT32(F) + n × 2^{23})) and performs in‑GM updates through AtomicAdd to eliminate data movement of large intermediate tensors. AMLA also introduces a Preload Pipeline and hierarchical tiling to overlap Cube and Vector work and maximize FLOPS utilization, achieving up to 86.8% FU on Ascend 910 and surpassing FlashMLA on contemporary GPUs. The approach yields stable numerical results and is integrated into Huawei’s CANN, with plans for public release, offering substantial practical impact for efficient long‑context decoding in LLMs.

Abstract

Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.

AMLA: MUL by ADD in FlashAttention Rescaling

TL;DR

This work tackles the decode‑phase bottlenecks of Multi‑Head Latent Attention (MLA) by introducing Ascend MLA (AMLA), a co‑designed kernel for Huawei Ascend NPUs. It replaces FP32 multiplications in the output rescaling with integer additions via a binary FP32–INT32 reinterpretation (F × 2^n = AS_FP32(AS_INT32(F) + n × 2^{23})) and performs in‑GM updates through AtomicAdd to eliminate data movement of large intermediate tensors. AMLA also introduces a Preload Pipeline and hierarchical tiling to overlap Cube and Vector work and maximize FLOPS utilization, achieving up to 86.8% FU on Ascend 910 and surpassing FlashMLA on contemporary GPUs. The approach yields stable numerical results and is integrated into Huawei’s CANN, with plans for public release, offering substantial practical impact for efficient long‑context decoding in LLMs.

Abstract

Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.

Paper Structure

This paper contains 36 sections, 5 theorems, 36 equations, 11 figures, 5 tables, 2 algorithms.

Key Result

Lemma 3.1

Given an FP32 number $F$, let $I=AS\_INT32(F)$ denote the integer represented by its binary pattern. Suppose $0 < E < 255$ is the unsigned integer value represented by $F$'s exponent bits. Then, for any integer $n \in \mathbb{Z}$ satisfying $-E < n < 255 -E$, the result of the multiplication $F\time

Figures (11)

  • Figure 1: Roofline Analysis of BF16 Decoding on Ascend 910. The dashed segment indicates the region where performance is limited by memory bandwidth, whereas the horizontal solid lines represent the peak compute-bound performance achievable. Data points corresponding to different attention variants are plotted, showcasing their operational regimes and proximity to the hardware limits.
  • Figure 2: Da Vinci V220 architecture. Cache capacities are accessible via the Ascend C API GetCoreMemSize.
  • Figure 3: The bit pattern 00111111000000000000000000000000 is 0.5 when interpreted as an FP32 value, and $126 \times 2^{23}$ when interpreted as an INT32 number.
  • Figure 4: Base vs. Ascend MLA.
  • Figure 5: Two-phase pipeline: (1) Preload resolves initial dependencies; (2) Steady Loop executes Cycles with maximal concurrency.
  • ...and 6 more figures

Theorems & Definitions (10)

  • Example 3.1
  • Lemma 3.1
  • proof
  • Remark 3.1
  • Remark 3.2
  • Theorem 4.1
  • Remark 4.1
  • Lemma B.1
  • Lemma B.2
  • Theorem B.1