Table of Contents
Fetching ...

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

Xinsheng Tang, Yangcheng Li, Nan Wang, Zhiyi Shu, Xingyu Ling, Junna Xing, Peng Zhou, Qiang Liu

TL;DR

RedFuser is designed, a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels and achieves up to 2× to 5× speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels.

Abstract

Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2$\times$ to 5$\times$ speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels. The code is available at https://github.com/alibaba/redfuser

RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators

TL;DR

RedFuser is designed, a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels and achieves up to 2× to 5× speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels.

Abstract

Operator fusion, as a key performance optimization technique in the deployment of AI models, significantly improves execution efficiency and has been widely adopted in modern AI compilers. However, for cascaded reduction operations involving multiple loops with inter-loop data dependencies, such as the safe softmax followed by GEMM within attention mechanisms, existing compilers lack effective automated fusion and kernel generation capabilities. Although some works have addressed specific instances through hand-crafted fusion strategies, their solutions are limited in generality and difficult to extend to other similar structures. Given the prevalence of such computational patterns in deep learning models, there remains significant untapped potential in achieving general and automated fusion optimization. In this paper, we present a formal theoretical methodology for analyzing cascaded reductions which can fuse them into a single loop and introduce an incremental computation form. Based on this methodology, we design Reduction Fuser (RedFuser), a framework that automatically identifies supported cascaded reduction patterns and generates optimized fused kernels. Experiments show that RedFuser successfully fuses diverse workloads, achieving up to 2 to 5 speedup over state-of-the-art AI compilers and matching the performance of highly optimized hand-written kernels. The code is available at https://github.com/alibaba/redfuser
Paper Structure (47 sections, 43 equations, 13 figures, 3 tables, 1 algorithm)

This paper contains 47 sections, 43 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: The reduction tree structure in cascaded reductions
  • Figure 2: The formal definition of cascaded reductions
  • Figure 3: Comparison of computation and memory access patterns in the cascaded reductions before and after fusion. (a) Unfused cascaded reductions: each reduction re-loads the input data and accesses the results from prior reductions. (b) Fused cascaded reductions: the input is loaded only once, and memory accesses to results of preceding reductions are eliminated.
  • Figure 4: Comparison between non-incremental and incremental computation. (a) Non-incremental computation: the $k$-th level reduction must wait until all inputs are available before execution and on-chip memory consumption grows with the input length. (b) Incremental computation: the result at level $k$ can be updated immediately upon arrival of new input, maintaining constant on-chip memory footprint.
  • Figure 5: The normalized performance of fusing four selected modules on GPUs.
  • ...and 8 more figures