Table of Contents
Fetching ...

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Maoyang Xiang, Bo Wang

Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16 and decreases DSP utilization by 16 while maintaining comparable or better performance across vision Transformers and GPT-2 models.
Paper Structure (11 sections, 7 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: The relationship between MSE/DWMSE and the difference in Top-1 ImageNet-1K classification accuracy relative to the FP32 baseline, evaluated across three Vision Transformer variants.
  • Figure 2: The relationship between MSE/DWMSE and the difference in WikiText2 PPL relative to the FP32 baseline, evaluated on GPT2.
  • Figure 3: The DAPA method approximates the GELU function and its derivative by partitioning the probability density function (PDF) into $N$ quantile regions.
  • Figure 4: DAPA functions generated from distributions with different numbers of images are evaluated on the ViT-Small.
  • Figure 5: Training loss curves for three ViT variants comparing standard PyTorch GELU activation with the DAPA(16) approximation function across epochs.
  • ...and 1 more figures