Table of Contents
Fetching ...

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu

TL;DR

This work proposes FastAttention, which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency and migrates FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead.

Abstract

FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, we propose FastAttention which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as representatives for designing our FastAttention. We migrate FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead, respectively. Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the operands layout in shared memory and introducing a simple yet effective CPU-GPU cooperative strategy for efficient memory utilization. On Ascend NPUs, our FastAttention can achieve a 10.7$\times$ speedup compared to the standard attention implementation. Llama-7B within FastAttention reaches up to 5.16$\times$ higher throughput than within the standard attention. On Volta architecture GPUs, FastAttention yields 1.43$\times$ speedup compared to its equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings 1.46$\times$ end-to-end speedup using FasterTransformer. Coupled with the propose CPU-GPU cooperative strategy, FastAttention supports a maximal input length of 256K on 8 V100 GPUs. All the codes will be made available soon.

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

TL;DR

This work proposes FastAttention, which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency and migrates FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead.

Abstract

FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, we propose FastAttention which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as representatives for designing our FastAttention. We migrate FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead, respectively. Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the operands layout in shared memory and introducing a simple yet effective CPU-GPU cooperative strategy for efficient memory utilization. On Ascend NPUs, our FastAttention can achieve a 10.7 speedup compared to the standard attention implementation. Llama-7B within FastAttention reaches up to 5.16 higher throughput than within the standard attention. On Volta architecture GPUs, FastAttention yields 1.43 speedup compared to its equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings 1.46 end-to-end speedup using FasterTransformer. Coupled with the propose CPU-GPU cooperative strategy, FastAttention supports a maximal input length of 256K on 8 V100 GPUs. All the codes will be made available soon.

Paper Structure

This paper contains 24 sections, 5 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: The comparison of architectures between resource-rich GPUs and Ascend NPUs
  • Figure 2: a) The unified tiling scheme with the fine-grained pipeline of Vector and Cube units; b) The two-level tiling strategy that employs the larger block size in the first level and maintains the smaller block size in the second level.
  • Figure 3: In case $b=3, M=3$, a M-mask matrix can be split into 6 B-mask matrices required by any given blocks through shifting.
  • Figure 4: The pipeline of the FastAttention with different block sizes.
  • Figure 5: An example of MMA instruction m8n8k4 for Volta.
  • ...and 12 more figures