AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
TL;DR
AttentionEngine tackles the challenge of optimizing attention computation across diverse hardware by introducing a unified abstraction that decomposes attention into relevance scoring and aggregation. It uses programmable templates and a cross-backend scheduling framework to automatically tailor kernel implementations to varying attention variants and hardware backends, including NVIDIA and AMD GPUs. The approach enables a two-pattern (Parallel and Recurrent) attention design with customizable modification and row-wise normalization functions, mapped to efficient kernel templates and hardware-aware scheduling. Empirically, it delivers up to $10.4$-fold speedups on configurations unreachable by prior methods and achieves significant improvements in both end-to-end inference and training, while supporting a broad range of attention variants and maintaining open-source accessibility for broad adoption.
Abstract
Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.
