Table of Contents
Fetching ...

Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis

Long Cheng, Qichen Liao, Fan Wu, Junlin Mu, Tengfei Han, Zhe Qiu, Lianqiang Li, Tianyi Liu, Fangzheng Miao, Keming Gao, Liang Wang, Zhen Zhang, Qiande Yin

TL;DR

This work tackles overflow and numerical instability in low-precision long-sequence attention for large language and multi-modal models. It introduces PASA, an online algorithm with two techniques—online pseudo-average shifting and global recovering—that preserves FA's mathematical intent while enabling half-precision computation. By formulating a shifting matrix $\mathbf{M}$ and tuning a hyper-parameter $\beta$ via a nonlinear condition, PASA mitigates large attention scores that cause overflow and maintains accuracy close to high-precision baselines. Experimental validation on random benchmarks and real models (e.g., Qwen-7B, Stable-Video-Diffusion IMG2VID) shows PASA prevents FP16 overflow, reduces RMSE relative to naive low-precision FA, and preserves output quality, enabling robust low-precision inference on NPUs and GPUs.

Abstract

Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use of half-precision computation throughout the Flash Attention process without incurring overflow instability or unacceptable numerical accuracy loss. This algorithm enhances performance on memory-restricted AI hardware architectures, such as the Ascend Neural-network Processing Unit(NPU), by reducing data movement and increasing computational FLOPs. The algorithm is validated using both designed random benchmarks and real large models. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow ($>65504$ for half precision) in two different categories of large models (Qwen2-7B language models and Stable-Video-Diffusion multi-modal models). Specifically, overflow arises due to the large bias in the sequence dimension and the resonance mechanism between the query and key in the head dimension of the Stable-Video-Diffusion models. The resonance mechanism is defined as phase coincidence or 180-degree phase shift between query and key matrices. It will remarkably amplify the element values of attention score matrix. This issue also applies to the Qwen models. Additionally, numerical accuracy is assessed through root mean square error (RMSE) and by comparing the final generated texts and videos to those produced using high-precision attention.

Online Pseudo-average Shifting Attention(PASA) for Robust Low-precision LLM Inference: Algorithms and Numerical Analysis

TL;DR

This work tackles overflow and numerical instability in low-precision long-sequence attention for large language and multi-modal models. It introduces PASA, an online algorithm with two techniques—online pseudo-average shifting and global recovering—that preserves FA's mathematical intent while enabling half-precision computation. By formulating a shifting matrix and tuning a hyper-parameter via a nonlinear condition, PASA mitigates large attention scores that cause overflow and maintains accuracy close to high-precision baselines. Experimental validation on random benchmarks and real models (e.g., Qwen-7B, Stable-Video-Diffusion IMG2VID) shows PASA prevents FP16 overflow, reduces RMSE relative to naive low-precision FA, and preserves output quality, enabling robust low-precision inference on NPUs and GPUs.

Abstract

Attention calculation is extremely time-consuming for long-sequence inference tasks, such as text or image/video generation, in large models. To accelerate this process, we developed a low-precision, mathematically-equivalent algorithm called PASA, based on Flash Attention. PASA introduces two novel techniques: online pseudo-average shifting and global recovering. These techniques enable the use of half-precision computation throughout the Flash Attention process without incurring overflow instability or unacceptable numerical accuracy loss. This algorithm enhances performance on memory-restricted AI hardware architectures, such as the Ascend Neural-network Processing Unit(NPU), by reducing data movement and increasing computational FLOPs. The algorithm is validated using both designed random benchmarks and real large models. We find that the large bias and amplitude of attention input data are critical factors contributing to numerical overflow ( for half precision) in two different categories of large models (Qwen2-7B language models and Stable-Video-Diffusion multi-modal models). Specifically, overflow arises due to the large bias in the sequence dimension and the resonance mechanism between the query and key in the head dimension of the Stable-Video-Diffusion models. The resonance mechanism is defined as phase coincidence or 180-degree phase shift between query and key matrices. It will remarkably amplify the element values of attention score matrix. This issue also applies to the Qwen models. Additionally, numerical accuracy is assessed through root mean square error (RMSE) and by comparing the final generated texts and videos to those produced using high-precision attention.

Paper Structure

This paper contains 21 sections, 1 theorem, 25 equations, 14 figures, 4 tables, 1 algorithm.

Key Result

Theorem 2.1

Let $\mathbf{I} \in \mathbf{R}^{s \times s}$ represents the identity matrix, and $\mathbf{J} \in \mathbf{R}^{s \times s}$ is the all-ones matrix. $\lambda$ is a parameter not smaller than zero. The shifting matrix is defined as $\mathbf{M} = \mathbf{I} - \lambda \mathbf{J}$. The inverse of the shift

Figures (14)

  • Figure 1: The Precision Allocation in the Original FA
  • Figure 2: Partially Low Precision(FP16) Allocations in FA
  • Figure 3: Fully Low Precision(FP16) Allocations in FA
  • Figure 4: The Diagram Framework of PASA
  • Figure 5: The Diagram for the Reduction of both Average Value and Amplitude with PASA.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Theorem 2.1