Table of Contents
Fetching ...

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

Zhixiong Zhao, Haomin Li, Fangxin Liu, Yuncheng Lu, Zongwu Wang, Tao Yang, Li Jiang, Haibing Guan

TL;DR

QUARK addresses a key bottleneck in Transformer inference: nonlinear operators such as Softmax, GELU, and LayerNorm. It introduces integer-only, hardware-friendly approximations and a reorder-based group quantization scheme to enable circuit sharing across nonlinear operators on FPGA. The main contributions are a sub-operator-sharing framework, offline channel reordering fused into weights, and a three-stage Group Quantization Unit that adapts per-layer distributions under a block-ops budget. Empirical results show QUARK delivers up to 1.96× end-to-end speedup over GPU, reduces nonlinear hardware overhead by over 50%, and maintains or even improves accuracy under ultra-low-bit quantization, demonstrating strong practical impact for CV/NLP transformers on FPGA.

Abstract

Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

QUARK: Quantization-Enabled Circuit Sharing for Transformer Acceleration by Exploiting Common Patterns in Nonlinear Operations

TL;DR

QUARK addresses a key bottleneck in Transformer inference: nonlinear operators such as Softmax, GELU, and LayerNorm. It introduces integer-only, hardware-friendly approximations and a reorder-based group quantization scheme to enable circuit sharing across nonlinear operators on FPGA. The main contributions are a sub-operator-sharing framework, offline channel reordering fused into weights, and a three-stage Group Quantization Unit that adapts per-layer distributions under a block-ops budget. Empirical results show QUARK delivers up to 1.96× end-to-end speedup over GPU, reduces nonlinear hardware overhead by over 50%, and maintains or even improves accuracy under ultra-low-bit quantization, demonstrating strong practical impact for CV/NLP transformers on FPGA.

Abstract

Transformer-based models have revolutionized computer vision (CV) and natural language processing (NLP) by achieving state-of-the-art performance across a range of benchmarks. However, nonlinear operations in models significantly contribute to inference latency, presenting unique challenges for efficient hardware acceleration. To this end, we propose QUARK, a quantization-enabled FPGA acceleration framework that leverages common patterns in nonlinear operations to enable efficient circuit sharing, thereby reducing hardware resource requirements. QUARK targets all nonlinear operations within Transformer-based models, achieving high-performance approximation through a novel circuit-sharing design tailored to accelerate these operations. Our evaluation demonstrates that QUARK significantly reduces the computational overhead of nonlinear operators in mainstream Transformer architectures, achieving up to a 1.96 times end-to-end speedup over GPU implementations. Moreover, QUARK lowers the hardware overhead of nonlinear modules by more than 50% compared to prior approaches, all while maintaining high model accuracy -- and even substantially boosting accuracy under ultra-low-bit quantization.

Paper Structure

This paper contains 16 sections, 27 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Latency Breakdown of ViT-Tiny on ZCU102.
  • Figure 2: Overview of QUARK Software.
  • Figure 3: Characteristics of GELU Operator in Transformers.
  • Figure 4: Visualization of post-nonlinear activations from the 6th layer in DeiT-T, illustrating key Transformer quantization challenges: LayerNorm's inter-channel variance, Softmax's zero-collapsing heavy-tailed distribution, and GELU's asymmetric activation range that challenges conventional symmetric quantization schemes.
  • Figure 5: The shared hardware components and reuse pathways among Softmax, GELU, and LayerNorm.
  • ...and 3 more figures