Table of Contents
Fetching ...

An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

Tseng-Jen Li, Tian-Sheuan Chang

TL;DR

The paper tackles the EMA bottleneck in transformer accelerators by introducing Tile-based Adaptive Stationary (TAS), which selects between input- and weight-stationary data layouts at a tile level based on input sequence length. Augmenting this with a hybrid tile-level strategy that leverages OS-based partial-sum reuse, TAS minimizes both temporal and spatial data transfers. The approach yields substantial EMA reductions (often exceeding 97%) and corresponding energy savings across models like Wav2Vec2.0-large and BERT-Base, while remaining compatible with existing attention optimizations. This work provides a flexible, energy-efficient dataflow paradigm for large-scale transformers on accelerators, addressing a key bottleneck in practical deployment.

Abstract

Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

An Efficient Data Reuse with Tile-Based Adaptive Stationary for Transformer Accelerators

TL;DR

The paper tackles the EMA bottleneck in transformer accelerators by introducing Tile-based Adaptive Stationary (TAS), which selects between input- and weight-stationary data layouts at a tile level based on input sequence length. Augmenting this with a hybrid tile-level strategy that leverages OS-based partial-sum reuse, TAS minimizes both temporal and spatial data transfers. The approach yields substantial EMA reductions (often exceeding 97%) and corresponding energy savings across models like Wav2Vec2.0-large and BERT-Base, while remaining compatible with existing attention optimizations. This work provides a flexible, energy-efficient dataflow paradigm for large-scale transformers on accelerators, addressing a key bottleneck in practical deployment.

Abstract

Transformer-based models have become the \textit{de facto} backbone across many fields, such as computer vision and natural language processing. However, as these models scale in size, external memory access (EMA) for weight and activations becomes a critical bottleneck due to its significantly higher energy consumption compared to internal computations. While most prior work has focused on optimizing the self-attention mechanism, little attention has been given to optimizing data transfer during linear projections, where EMA costs are equally important. In this paper, we propose the Tile-based Adaptive Stationary (TAS) scheme that selects the input or weight stationary in a tile granularity, based on the input sequence length. Our experimental results demonstrate that TAS can significantly reduce EMA by more than 97\% compared to traditional stationary schemes, while being compatible with various attention optimization techniques and hardware accelerators.

Paper Structure

This paper contains 12 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Matrix Mapping for Matrix-Matrix Multiplication with Conventional Stationary Schemes
  • Figure 2: Matrix Mapping for Matrix-Matrix Multiplication with Proposed Stationary Schemes