Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Tianheng Ling; Chao Qian; Gregor Schiele

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Tianheng Ling, Chao Qian, Gregor Schiele

TL;DR

This study enhanced the flexibility of the VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently, and developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search.

Abstract

This study addresses the deployment challenges of integer-only quantized Transformers on resource-constrained embedded FPGAs (Xilinx Spartan-7 XC7S15). We enhanced the flexibility of our VHDL template by introducing a selectable resource type for storing intermediate results across model layers, thereby breaking the deployment bottleneck by utilizing BRAM efficiently. Moreover, we developed a resource-aware mixed-precision quantization approach that enables researchers to explore hardware-level quantization strategies without requiring extensive expertise in Neural Architecture Search. This method provides accurate resource utilization estimates with a precision discrepancy as low as 3%, compared to actual deployment metrics. Compared to previous work, our approach has successfully facilitated the deployment of model configurations utilizing mixed-precision quantization, thus overcoming the limitations inherent in five previously non-deployable configurations with uniform quantization bitwidths. Consequently, this research enhances the applicability of Transformers in embedded systems, facilitating a broader range of Transformer-powered applications on edge devices.

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

TL;DR

Abstract

Paper Structure (24 sections, 7 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 7 figures, 7 tables, 1 algorithm.

Introduction
Transformers for Time-series Forecasting
Problem Statement
Proposed Solutions
Adaptive Resource Allocation
Mixed-precision Quantized Transformer
Transition to Mixed-precision Quantization
Resource-aware Mixed-precision Quantization
Phase 1: Preparation
Phase 2: Estimation
Phase 3: Filtering
Phase 4: Validation
Preparation of Knowledge Database
Resource Estimation and Filtering
Experiments and Evaluation
...and 9 more sections

Figures (7)

Figure 1: The Architecture of the Transformer Model
Figure 2: Uniform (left) vs Mixed (right) 8-bit Linear Layer
Figure 3: Uniform (left) vs Mixed (right) 8-bit Addition Operation
Figure 4: Uniform (up) vs Mixed (down) 8-bit FFN Module
Figure 5: Workflow of Resource-aware Mixed-precision Quantization
...and 2 more figures

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

TL;DR

Abstract

Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)