Table of Contents
Fetching ...

FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration

Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je

TL;DR

RAP is proposed, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique, which improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability.

Abstract

Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy ($2^{ENOB}$), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ's challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.

FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration

TL;DR

RAP is proposed, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique, which improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability.

Abstract

Encoder-based transformers, powered by self-attention layers, have revolutionized machine learning with their context-aware representations. However, their quadratic growth in computational and memory demands presents significant bottlenecks. Analog-Mixed-Signal Process-in-Memory (AMS-PiM) architectures address these challenges by enabling efficient on-chip processing. Traditionally, AMS-PiM relies on Quantization-Aware Training (QAT), which is hardware-efficient but requires extensive retraining to adapt models to AMS-PiMs, making it increasingly impractical for transformer models. Post-Training Quantization (PTQ) mitigates this training overhead but introduces significant hardware inefficiencies. PTQ relies on dequantization-quantization (DQ-Q) processes, floating-point units (FPUs), and high-ENOB (Effective Number of Bits) analog-to-digital converters (ADCs). Particularly, High-ENOB ADCs scale exponentially in area and energy (), reduce sensing margins, and increase susceptibility to process, voltage, and temperature (PVT) variations, further compounding PTQ's challenges in AMS-PiM systems. To overcome these limitations, we propose RAP, an AMS-PiM architecture that eliminates DQ-Q processes, introduces FPU- and division-free nonlinear processing, and employs a low-ENOB-ADC-based sparse Matrix Vector multiplication technique. Using the proposed techniques, RAP improves error resiliency, area/energy efficiency, and computational speed while preserving numerical stability. Experimental results demonstrate that RAP outperforms state-of-the-art GPUs and conventional PiM architectures in energy efficiency, latency, and accuracy, making it a scalable solution for the efficient deployment of transformers.

Paper Structure

This paper contains 23 sections, 6 equations, 18 figures, 1 table, 1 algorithm.

Figures (18)

  • Figure 1: Limitations of PTQ during inference optimizations. (a) The Dequantization-Quantization (DQ-Q) process and (b) nonlinear layers induce high-ENOB ADCs and FPUs that induce extreme area/energy overhead.
  • Figure 2: Design example for typical PiM devices, where the level-2 (GEMV) and level-3 (GEMM) BLAS operations are processed inherently.
  • Figure 3: Overview of self-attention, with operation type and tensor traffic analysis. Self-attention comprises multiple linear projection layers with two types of GEMV operations, connected by Softmax, dimensional adjustments (transpose and concatenation), and DQ-Q operations. Unlike QAT, the DQ-Q process and nonlinear layers in PTQ create a deadlock between hardware overhead and tensor traffic.
  • Figure 4: We alleviate not only the reliance on high-ENOB ADCs and FPUs but also the need for division arithmetic in our PiM hardware.
  • Figure 5: (a) Even for a lossless quantization, values lose the exponent information. (b) The linear projections are consistent without exponent. (c) Softmax with the values shown at (a) - the absence of exponents deteriorates the computation result of nonlinear layers.
  • ...and 13 more figures