Table of Contents
Fetching ...

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen

TL;DR

AttentionEngine tackles the challenge of optimizing attention computation across diverse hardware by introducing a unified abstraction that decomposes attention into relevance scoring and aggregation. It uses programmable templates and a cross-backend scheduling framework to automatically tailor kernel implementations to varying attention variants and hardware backends, including NVIDIA and AMD GPUs. The approach enables a two-pattern (Parallel and Recurrent) attention design with customizable modification and row-wise normalization functions, mapped to efficient kernel templates and hardware-aware scheduling. Empirically, it delivers up to $10.4$-fold speedups on configurations unreachable by prior methods and achieves significant improvements in both end-to-end inference and training, while supporting a broad range of attention variants and maintaining open-source accessibility for broad adoption.

Abstract

Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.

AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms

TL;DR

AttentionEngine tackles the challenge of optimizing attention computation across diverse hardware by introducing a unified abstraction that decomposes attention into relevance scoring and aggregation. It uses programmable templates and a cross-backend scheduling framework to automatically tailor kernel implementations to varying attention variants and hardware backends, including NVIDIA and AMD GPUs. The approach enables a two-pattern (Parallel and Recurrent) attention design with customizable modification and row-wise normalization functions, mapped to efficient kernel templates and hardware-aware scheduling. Empirically, it delivers up to -fold speedups on configurations unreachable by prior methods and achieves significant improvements in both end-to-end inference and training, while supporting a broad range of attention variants and maintaining open-source accessibility for broad adoption.

Abstract

Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.

Paper Structure

This paper contains 22 sections, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: The foundational attention mechanism and its variants: Attention mechanisms is divided into stages such as embedding, interaction, normalization, and composition(left). Attention variants make various changes to these stages(right). For example, Causal Attention modified the interaction stage to apply a mask, which makes the computation flow different.
  • Figure 2: The performance of attention implementations.
  • Figure 3: System overview: AttentionEngine begins with attention templates in the Programming Interface to define Custom Attention. Then they are lowered to kernel templates and automatically scheduled to generate the best execution plan on the device.
  • Figure 4: On the left is AttentionEngine’s unified attention template. By instantiating this template, two distinct patterns are produced (Parallel Pattern and Recurrent Pattern). The red box highlights the operations corresponding to the core components of the attention mechanism in the unified attention template: relevance_scoring and aggregate. Both the customizable_function and the mod function are user-defined. The customizable_function encompasses both modification function and row-wise normalization function, whereas the mod function is restricted to modification function only.
  • Figure 5: Customizable functions in programming interface
  • ...and 9 more figures