Table of Contents
Fetching ...

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

Janghwan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi

TL;DR

The paper tackles the challenge of making large language models more computationally efficient by extending post-training quantization to joint 4-bit weights and 8-bit activations (W4A8). It introduces three key innovations: Activation-Quantization-Aware Scaling (AQAS) to balance weight and activation quantization, Sequence-Length-Aware Calibration (SLAC) to align calibration with target task sequence lengths, and dINT, a denormal-oriented integer format, to mitigate underflow. Through experiments on OPT and LLaMA models across language modeling, zero-shot reasoning, and in-context learning, the approach yields accuracy close to full-precision models and achieves about $2\times$ hardware efficiency due to compatible dINT-based MAC units. This combination of PTQ techniques and a hardware-conscious numerical format promises substantial practical impact for deploying large-scale LLMs in resource-constrained settings.

Abstract

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

TL;DR

The paper tackles the challenge of making large language models more computationally efficient by extending post-training quantization to joint 4-bit weights and 8-bit activations (W4A8). It introduces three key innovations: Activation-Quantization-Aware Scaling (AQAS) to balance weight and activation quantization, Sequence-Length-Aware Calibration (SLAC) to align calibration with target task sequence lengths, and dINT, a denormal-oriented integer format, to mitigate underflow. Through experiments on OPT and LLaMA models across language modeling, zero-shot reasoning, and in-context learning, the approach yields accuracy close to full-precision models and achieves about hardware efficiency due to compatible dINT-based MAC units. This combination of PTQ techniques and a hardware-conscious numerical format promises substantial practical impact for deploying large-scale LLMs in resource-constrained settings.

Abstract

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2 hardware efficiency improvement compared to 8-bit integer MAC unit.
Paper Structure (27 sections, 5 equations, 6 figures, 14 tables)

This paper contains 27 sections, 5 equations, 6 figures, 14 tables.

Figures (6)

  • Figure 1: (a) Illustration of fused-layernorm (fused-LN) in OPT (top) and layernorm (LN) in LLaMA (bottom) computation patterns within a Transformer layer. Note that two computation patterns yield ths same output if computed in full-precision, but they deviate when activation and weight are quantized. (b) Min-Max range of input activations (left) and weight (right) as operands of matrix multiplication. (c) Min-Max range of input activation varying sequence length from 128 to 2048 (Orange: OPT-6.7B, Blue: LLaMA-7B). (d) Max values of per-channel input activation for OPT-6.7B (left) and LLaMA-7B (right) for different input sequence lengths (32 and 2048).
  • Figure 2: Absolute max value of (a) input activation and (b) weight after scaling by each method (OPT-6.7B). We observed that these trends were significantly pronounced in OPT models due to large outliers. (See Fig. \ref{['fig-scale-llama']} for the same plot for LLaMA.)
  • Figure 3: (a) Comparison of weight update ratio in Eq. \ref{['eq: delta']} in OPT-6.7B, LLaMA-7B, and LLaMA-7B with AQAS scaling. (b) Minimum input activation range for the query layer in three models: W4A8 (calibrated with 128 and 2048 sequence lengths) and full-precision (FP), all evaluated under an input sequence length of 128.
  • Figure 4: (a) INT4 without rounding sets small values near zero to zero, preserving the rest and causing performance degradation. INT4 without underflow preserves only values near zero, improving performance. (b) Impact of underflow error and rounding error on the output error. Significant impact of underflow error on the output error in INT4. (c) Proposed dINT4 preserves two small values near zero, preventing performance degradation. (d) Using the proposed dINT4 to reduce underflow error leads to a significant reduction in output error.
  • Figure 5: (Blue) Values to be quantized. (Orange) INT4 quantized values, evenly spaced. (Green) FP4 quantized values, dense resolution for small values but coarse resolution for large magnitudes. (Red) Proposed dINT4 format, balanced quantization range with a separate special value for small values.
  • ...and 1 more figures