Table of Contents
Fetching ...

Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

Pingcheng Dong, Yonghao Tan, Dong Zhang, Tianwei Ni, Xuejiao Liu, Yu Liu, Peng Luo, Luhong Liang, Shih-Yang Liu, Xijie Huang, Huaiyu Zhu, Yun Pan, Fengwei An, Kwang-Ting Cheng

TL;DR

The paper tackles the hardware cost of non-linear operations in Transformer models on edge devices by introducing GQA-LUT, a quantization-aware genetic LUT-Approximation that enables integer-only computation for pwl approximations. GQA-LUT uses a genetic algorithm to optimize the breakpoints of the piece-wise linear functions and adds a rounding mutation to alleviate breakpoint deviation when the scaling factor is large, with the scale constrained to a power-of-two $S=2^{\left\lfloor \log_2^{\alpha}\right\rceil}$ to simplify hardware. It achieves negligible degradation on semantic segmentation benchmarks like Cityscapes and delivers substantial INT8 hardware savings, reported as around $81.3\%-81.7\%$ area and $79.3\%-80.2\%$ power reductions compared to FP32/INT32. These results demonstrate a practical, hardware-friendly pathway to accelerate non-linear operators in Transformers using low-bit, integer-only LUT-Approximation.

Abstract

Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https:// github.com/PingchengDong/GQA-LUT.

Genetic Quantization-Aware Approximation for Non-Linear Operations in Transformers

TL;DR

The paper tackles the hardware cost of non-linear operations in Transformer models on edge devices by introducing GQA-LUT, a quantization-aware genetic LUT-Approximation that enables integer-only computation for pwl approximations. GQA-LUT uses a genetic algorithm to optimize the breakpoints of the piece-wise linear functions and adds a rounding mutation to alleviate breakpoint deviation when the scaling factor is large, with the scale constrained to a power-of-two to simplify hardware. It achieves negligible degradation on semantic segmentation benchmarks like Cityscapes and delivers substantial INT8 hardware savings, reported as around area and power reductions compared to FP32/INT32. These results demonstrate a practical, hardware-friendly pathway to accelerate non-linear operators in Transformers using low-bit, integer-only LUT-Approximation.

Abstract

Non-linear functions are prevalent in Transformers and their lightweight variants, incurring substantial and frequently underestimated hardware costs. Previous state-of-the-art works optimize these operations by piece-wise linear approximation and store the parameters in look-up tables (LUT), but most of them require unfriendly high-precision arithmetics such as FP/INT 32 and lack consideration of integer-only INT quantization. This paper proposed a genetic LUT-Approximation algorithm namely GQA-LUT that can automatically determine the parameters with quantization awareness. The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models. Besides, proposed GQA-LUT enables the employment of INT8-based LUT-Approximation that achieves an area savings of 81.3~81.7% and a power reduction of 79.3~80.2% compared to the high-precision FP/INT 32 alternatives. Code is available at https:// github.com/PingchengDong/GQA-LUT.
Paper Structure (15 sections, 3 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 15 sections, 3 equations, 3 figures, 6 tables, 2 algorithms.

Figures (3)

  • Figure 1: Taxonomy of LUT-Approximation: (a) FP/INT32 LUT storage pattern, (b) INT8/16 LUT storage pattern with quantization awareness.
  • Figure 2: (a) Comparison of normalized MSE among NN-LUT, GQA-LUT, and GQA-LUT with RM strategy for GELU approximation using an 8-entry LUT, (b) breakpoint quantization analysis of GQA-LUT without RM for EXP under different scaling factors.
  • Figure 3: Comparison of normalized MSE for GELU, HSWISH, and EXP across various INT8 quantization scaling factors $S$ using NN-LUT, GQA-LUT, and the RM strategy with 8/16-entry LUT approximation.