Table of Contents
Fetching ...

Can Asymmetric Tile Buffering Be Beneficial?

Chengyue Wang, Wesley Pang, Xinrui Wu, Gregory Jun, Luis Romero, Endri Taka, Diana Marculescu, Tony Nowatzki, Pranathi Vasireddy, Joseph Melber, Deming Chen, Jason Cong

TL;DR

This work introduces asymmetric tile buffering (ATB) for GEMM, decoupling the buffered dimensions of A and C to raise arithmetic intensity while reducing input-buffer pressure. It develops a coupled analytical model of arithmetic intensity and instruction-level parallelism to guide tiling and microkernel optimization, including double buffering and input-sharing strategies. Empirical evaluation on AMD XDNA2 AIE demonstrates up to 4.54× speedup over symmetric buffering and substantial gains over state-of-the-art baselines across mixed-precision configurations, including BF16/BFP16. The findings show ATB can shift performance from memory-bound toward compute-bound regimes and establish practical guidelines for tiling, kernel design, and mixed-precision GEMM on reconfigurable NPUs.

Abstract

General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input $A$ along the dimension $M$ matches the output tile size of $C$. In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.

Can Asymmetric Tile Buffering Be Beneficial?

TL;DR

This work introduces asymmetric tile buffering (ATB) for GEMM, decoupling the buffered dimensions of A and C to raise arithmetic intensity while reducing input-buffer pressure. It develops a coupled analytical model of arithmetic intensity and instruction-level parallelism to guide tiling and microkernel optimization, including double buffering and input-sharing strategies. Empirical evaluation on AMD XDNA2 AIE demonstrates up to 4.54× speedup over symmetric buffering and substantial gains over state-of-the-art baselines across mixed-precision configurations, including BF16/BFP16. The findings show ATB can shift performance from memory-bound toward compute-bound regimes and establish practical guidelines for tiling, kernel design, and mixed-precision GEMM on reconfigurable NPUs.

Abstract

General matrix multiplication (GEMM) is the computational backbone of modern AI workloads, and its efficiency is critically dependent on effective tiling strategies. Conventional approaches employ symmetric tile buffering, where the buffered tile size of the input along the dimension matches the output tile size of . In this paper, we introduce asymmetric tile buffering (ATB), a simple but powerful technique that decouples the buffered tile dimensions of the input and output operands. We show, for the first time, that ATB is both practical and highly beneficial. To explain this effect, we develop a performance model that incorporates both the benefits of ATB (higher arithmetic intensity) and its overheads (higher kernel switching costs), providing insight into how to select effective ATB tiling factors. As a case study, we apply ATB to AMD's latest XDNA2 AI Engine (AIE), achieving up to a 4.54x speedup, from 4.8 to 24.6 TFLOPS on mixed-precision BFP16--BF16 GEMM, establishing a new performance record for XDNA2 AIE.

Paper Structure

This paper contains 24 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Architecture and Roofline: AMD AIE XDNA2™ spatial architecture (left) and the roofline view illustrating the performance impact of kernel and tiling strategies (right).
  • Figure 2: Comparison of symmetric (left) and asymmetric (right) tile buffering. Asymmetric tiles ($T_{M_A}$, $T_{M_C}$, $T_K$, $T_N$) increase arithmetic intensity but also switching overhead.
  • Figure 3: Example schedule of an accumulation chain. Assume 3-cycle latency per instruction, and loading one input takes 1 instruction, accumulator load/store takes 2.
  • Figure 4: Comparison of single-buffered (left) vs. double-buffered (right) input register allocation.
  • Figure 5: GEMM microkernel optimizations: top-left, input sharing across a 2$\times$2 chain cluster; bottom-left, prolog–epilog overlap across clusters; right, corresponding code.