An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators
Tseng-Jen Li, Tian-Sheuan Chang
TL;DR
The paper tackles the EMA bottleneck in transformer accelerators by introducing Tile-based Adaptive Stationary (TAS), which selects between input- and weight-stationary data layouts at a tile level based on input sequence length. Augmenting this with a hybrid tile-level strategy that leverages OS-based partial-sum reuse, TAS minimizes both temporal and spatial data transfers. The approach yields substantial EMA reductions (often exceeding 97%) and corresponding energy savings across models like Wav2Vec2.0-large and BERT-Base, while remaining compatible with existing attention optimizations. This work provides a flexible, energy-efficient dataflow paradigm for large-scale transformers on accelerators, addressing a key bottleneck in practical deployment.
Abstract
Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.
