Table of Contents
Fetching ...

SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization

Rui Xie, Asad Ul Haq, Linsen Ma, Krystal Sun, Sanchari Sen, Swagath Venkataramani, Liu Liu, Tong Zhang

TL;DR

This work addresses the inefficiency of uniform weight quantization in large generative AI models by exploiting context-dependent weight importance during inference. It introduces SmartQuant, a CXL-based AI model store that supports runtime configurable weight quantization through two mechanisms: bit-plane in-memory placement and memory logical space bloating, enabling on-the-fly quantization without protocol changes. Using the OPT transformer as a testbed, the authors demonstrate that non-uniform quantization achieves better perplexity than uniform schemes and dramatically reduces DRAM access energy and model load latency, with energy savings up to 40.3% and latency reductions up to 42.1%. The approach shows practical potential for accelerating transformer inference on future CXL-enabled architectures by leveraging proportional DRAM efficiency and seamless integration with existing hardware support for variable-precision arithmetic.

Abstract

Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.

SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization

TL;DR

This work addresses the inefficiency of uniform weight quantization in large generative AI models by exploiting context-dependent weight importance during inference. It introduces SmartQuant, a CXL-based AI model store that supports runtime configurable weight quantization through two mechanisms: bit-plane in-memory placement and memory logical space bloating, enabling on-the-fly quantization without protocol changes. Using the OPT transformer as a testbed, the authors demonstrate that non-uniform quantization achieves better perplexity than uniform schemes and dramatically reduces DRAM access energy and model load latency, with energy savings up to 40.3% and latency reductions up to 42.1%. The approach shows practical potential for accelerating transformer inference on future CXL-enabled architectures by leveraging proportional DRAM efficiency and seamless integration with existing hardware support for variable-precision arithmetic.

Abstract

Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.
Paper Structure (11 sections, 7 figures)

This paper contains 11 sections, 7 figures.

Figures (7)

  • Figure 1: Illustration of CXL memory devices with built-in quantization format conversion to support low-cost AI inference.
  • Figure 2: Illustration of DRAM structure with bit-plane in-memory placement.
  • Figure 3: Illustration of CXL memory logical space bloating.
  • Figure 4: Measured perplexity (lower is better) under different weight quantization configurations.
  • Figure 5: Average percentage of different quantization formats in (a) the entire model (including all the predictors), (b) attention layers, and (c) MLP layers under different target bits/weight in OPT 30b model.
  • ...and 2 more figures