Table of Contents
Fetching ...

Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity

Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee

TL;DR

Panacea tackles energy-efficient large-scale DNN inference by enabling asymmetrically quantized bit-slice GEMMs. It introduces AQS-GEMM, which compresses frequent nonzero high-order slices arising from asymmetric activation quantization and skips their computation, supported by ZPM and DBS to maximize HO-slice sparsity. The hardware design features a tile-based, data-reuse-focused architecture with 16 PEAs and a compensation mechanism to preserve exact results, achieving substantial gains over prior accelerators on both transformers and CNNs. This approach yields notable improvements in energy efficiency and throughput for large models, including LLMs, while maintaining accuracy, and the authors provide ASIC-level results and an open-source plan. The work demonstrates that combining asymmetric quantization with sparsity-aware bit-slice computation and targeted calibration can substantially reduce memory accesses and energy in DNN inference hardware.

Abstract

Low bit-precisions and their bit-slice sparsity have recently been studied to accelerate general matrix-multiplications (GEMM) during large-scale deep neural network (DNN) inferences. While the conventional symmetric quantization facilitates low-resolution processing with bit-slice sparsity for both weight and activation, its accuracy loss caused by the activation's asymmetric distributions cannot be acceptable, especially for large-scale DNNs. In efforts to mitigate this accuracy loss, recent studies have actively utilized asymmetric quantization for activations without requiring additional operations. However, the cutting-edge asymmetric quantization produces numerous nonzero slices that cannot be compressed and skipped by recent bit-slice GEMM accelerators, naturally consuming more processing energy to handle the quantized DNN models. To simultaneously achieve high accuracy and hardware efficiency for large-scale DNN inferences, this paper proposes an Asymmetrically-Quantized bit-Slice GEMM (AQS-GEMM) for the first time. In contrast to the previous bit-slice computing, which only skips operations of zero slices, the AQS-GEMM compresses frequent nonzero slices, generated by asymmetric quantization, and skips their operations. To increase the slice-level sparsity of activations, we also introduce two algorithm-hardware co-optimization methods: a zero-point manipulation and a distribution-based bit-slicing. To support the proposed AQS-GEMM and optimizations at the hardware-level, we newly introduce a DNN accelerator, Panacea, which efficiently handles sparse/dense workloads of the tiled AQS-GEMM to increase data reuse and utilization. Panacea supports a specialized dataflow and run-length encoding to maximize data reuse and minimize external memory accesses, significantly improving its hardware efficiency. Our benchmark evaluations show Panacea outperforms existing DNN accelerators.

Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity

TL;DR

Panacea tackles energy-efficient large-scale DNN inference by enabling asymmetrically quantized bit-slice GEMMs. It introduces AQS-GEMM, which compresses frequent nonzero high-order slices arising from asymmetric activation quantization and skips their computation, supported by ZPM and DBS to maximize HO-slice sparsity. The hardware design features a tile-based, data-reuse-focused architecture with 16 PEAs and a compensation mechanism to preserve exact results, achieving substantial gains over prior accelerators on both transformers and CNNs. This approach yields notable improvements in energy efficiency and throughput for large models, including LLMs, while maintaining accuracy, and the authors provide ASIC-level results and an open-source plan. The work demonstrates that combining asymmetric quantization with sparsity-aware bit-slice computation and targeted calibration can substantially reduce memory accesses and energy in DNN inference hardware.

Abstract

Low bit-precisions and their bit-slice sparsity have recently been studied to accelerate general matrix-multiplications (GEMM) during large-scale deep neural network (DNN) inferences. While the conventional symmetric quantization facilitates low-resolution processing with bit-slice sparsity for both weight and activation, its accuracy loss caused by the activation's asymmetric distributions cannot be acceptable, especially for large-scale DNNs. In efforts to mitigate this accuracy loss, recent studies have actively utilized asymmetric quantization for activations without requiring additional operations. However, the cutting-edge asymmetric quantization produces numerous nonzero slices that cannot be compressed and skipped by recent bit-slice GEMM accelerators, naturally consuming more processing energy to handle the quantized DNN models. To simultaneously achieve high accuracy and hardware efficiency for large-scale DNN inferences, this paper proposes an Asymmetrically-Quantized bit-Slice GEMM (AQS-GEMM) for the first time. In contrast to the previous bit-slice computing, which only skips operations of zero slices, the AQS-GEMM compresses frequent nonzero slices, generated by asymmetric quantization, and skips their operations. To increase the slice-level sparsity of activations, we also introduce two algorithm-hardware co-optimization methods: a zero-point manipulation and a distribution-based bit-slicing. To support the proposed AQS-GEMM and optimizations at the hardware-level, we newly introduce a DNN accelerator, Panacea, which efficiently handles sparse/dense workloads of the tiled AQS-GEMM to increase data reuse and utilization. Panacea supports a specialized dataflow and run-length encoding to maximize data reuse and minimize external memory accesses, significantly improving its hardware efficiency. Our benchmark evaluations show Panacea outperforms existing DNN accelerators.

Paper Structure

This paper contains 13 sections, 5 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Accuracy comparison on recent works utilizing symmetric wu2020easyquantliu2021postli2021brecqbanner2018aciqxiao2023smoothquant and asymmetric quantization lin2021fqliu2023pdlee2023flexroundcai2020zeroqli2023repqwei2022outlierwei2023outliernagel2021whiteliu2023qllmshao2023omniquant for activations in large-scale DNNs.
  • Figure 2: Examples of uniform quantization methods: 8-bit (a) symmetric and (b) asymmetric approaches.
  • Figure 3: (a) the straightforward bit-slice representation shomron2020non, and (b) the signed bit-slice representation im2024sibia.
  • Figure 4: An example of the bit-slice GEMM using 7-bit symmetric quantization and the SBR for both weight and activationim2024sibia.
  • Figure 5: (a) Distributions of asymmetrically quantized activations. (b) Accuracy comparison when using different GEMMs for BERT-basedevlin2018bert and GLUE dataset (MNLI) wang2018glue.
  • ...and 15 more figures