DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Maoyang Xiang; Bo Wang

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Maoyang Xiang, Bo Wang

Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16$\times$ and decreases DSP utilization by 16$\times$ while maintaining comparable or better performance across vision Transformers and GPT-2 models.

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Abstract

and decreases DSP utilization by 16

while maintaining comparable or better performance across vision Transformers and GPT-2 models.

Paper Structure (11 sections, 7 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 11 sections, 7 equations, 6 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Methodology of DAPA
Distribution-Weighted Mean Squared Error
Design of Distribution-Aware Piecewise Activation
Experimental Results
Impact of Number of Input Samples
Performance on Image Classification
Performance on Natural Language Processing Tasks
Training Transformers with DAPA
Hardware Implementation

Figures (6)

Figure 1: The relationship between MSE/DWMSE and the difference in Top-1 ImageNet-1K classification accuracy relative to the FP32 baseline, evaluated across three Vision Transformer variants.
Figure 2: The relationship between MSE/DWMSE and the difference in WikiText2 PPL relative to the FP32 baseline, evaluated on GPT2.
Figure 3: The DAPA method approximates the GELU function and its derivative by partitioning the probability density function (PDF) into $N$ quantile regions.
Figure 4: DAPA functions generated from distributions with different numbers of images are evaluated on the ViT-Small.
Figure 5: Training loss curves for three ViT variants comparing standard PyTorch GELU activation with the DAPA(16) approximation function across epochs.
...and 1 more figures

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Abstract

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Authors

Abstract

Table of Contents

Figures (6)