Table of Contents
Fetching ...

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen, Mingxuan Yuan, Bei Yu

TL;DR

MixPE is introduced, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference that surpasses the state-of-the-art quantization accelerators by 2.6 times and 1.4 times, respectively.

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by $2.6\times$ speedup and $1.4\times$ energy reduction.

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

TL;DR

MixPE is introduced, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference that surpasses the state-of-the-art quantization accelerators by 2.6 times and 1.4 times, respectively.

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock the full potential of low-bit quantization. First, recognizing that scale and zero point are shared within each quantization group, we propose performing dequantization after per-group mpGEMM, significantly reducing dequantization overhead. Second, instead of relying on conventional multipliers, MixPE utilizes efficient shift\&add operations for multiplication, optimizing both computation and energy efficiency. Our experimental results demonstrate that MixPE surpasses the state-of-the-art quantization accelerators by speedup and energy reduction.

Paper Structure

This paper contains 18 sections, 8 equations, 9 figures.

Figures (9)

  • Figure 1: (Left) The dequantization overhead of Llama-2-7B quantized in W4A8. (Right) MixPE achieves over $4\times$ speedup when running LLMs compared to INT8-based TPUs.
  • Figure 2: The mpGEMM operations in decoder-only LLMs.
  • Figure 3: (Left) Quantized GEMM on GPUs. The low-precision weights are first dequantized to high precision (Step ①). Each group then performs multiplication using conventional high-precision units (Step ②), and the results are accumulated to produce the final output (Step ③). (Right) Quantized mpGEMM on MixPE. The mixed-precision multiplication of each group is first computed by MixPE with efficient low-bit support(Step ①). The results are then dequantized and subsequently accumulated to produce the final output (Steps ②,③).
  • Figure 4: (Left) Traditional multiplier-based PE design.(Right) INT4$\times$INT8 processing element in our MixPE. MixPE achieves efficient computation by directly utilizing hardware-friendly shift operations and an optimized adder tree, eliminating the need to transform INT4 to INT8, leading to reduced power consumption and improved throughput.
  • Figure 5: Architecture overview for integrating MixPE into the output stationary systolic array.
  • ...and 4 more figures