Table of Contents
Fetching ...

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu

TL;DR

The paper tackles the lack of a clear winner between low-bit integer and floating-point quantization for large language models by conducting a thorough, layer-wise comparison at fixed bit-widths and introducing Mixture of Formats Quantization (MoFQ). MoFQ selects the optimal format (INT or FP) per layer based on quantization-error metrics, and is augmented by a NaN/Inf reallocation strategy for FP4 to boost precision. The approach yields state-of-the-art results for both weight-only 4-bit and weight-activation 8-bit post-training quantization on LLaMA and OPT, with substantial speedups over prior methods and no additional hardware overhead. These results demonstrate that simple, hardware-friendly per-layer format selection can dramatically improve low-bit quantization performance for large-scale models, enabling efficient deployment on existing and forthcoming accelerators.

Abstract

Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves state-of-the-art results in both weight-only (W-only) and weight-activation (WA) post-training quantization scenarios when tested on LLaMA across various tasks. In 4-bit W-only quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and with an order of magnitude faster quantization speed. While in 8-bit WA quantization, MoFQ significantly outperforms INT/FP-only methods, achieving performance close to the full precision model. Notably, MoFQ incurs no hardware overhead compared to INT/FP-only quantization, as the bit-width remains unchanged.

Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models

TL;DR

The paper tackles the lack of a clear winner between low-bit integer and floating-point quantization for large language models by conducting a thorough, layer-wise comparison at fixed bit-widths and introducing Mixture of Formats Quantization (MoFQ). MoFQ selects the optimal format (INT or FP) per layer based on quantization-error metrics, and is augmented by a NaN/Inf reallocation strategy for FP4 to boost precision. The approach yields state-of-the-art results for both weight-only 4-bit and weight-activation 8-bit post-training quantization on LLaMA and OPT, with substantial speedups over prior methods and no additional hardware overhead. These results demonstrate that simple, hardware-friendly per-layer format selection can dramatically improve low-bit quantization performance for large-scale models, enabling efficient deployment on existing and forthcoming accelerators.

Abstract

Efficient deployment of large language models (LLMs) necessitates low-bit quantization to minimize model size and inference cost. While low-bit integer formats (e.g., INT8/INT4) have been the conventional choice, emerging low-bit floating-point formats (e.g., FP8/FP4) offer a compelling alternative and are gaining support from cutting-edge hardware, such as NVIDIA's H100 GPU. However, the superiority of low-bit INT versus FP formats for quantization on LLMs remains unclear. In this study, we conduct a comparative analysis of INT and FP quantization with the same bit-width, revealing that the optimal quantization format varies across different layers due to the complexity and diversity of tensor distribution. Consequently, we advocate the Mixture of Formats Quantization (MoFQ), which selects the optimal format on a layer-wise basis. This simple yet effective approach achieves state-of-the-art results in both weight-only (W-only) and weight-activation (WA) post-training quantization scenarios when tested on LLaMA across various tasks. In 4-bit W-only quantization, MoFQ surpasses GPTQ without complex hyperparameter tuning and with an order of magnitude faster quantization speed. While in 8-bit WA quantization, MoFQ significantly outperforms INT/FP-only methods, achieving performance close to the full precision model. Notably, MoFQ incurs no hardware overhead compared to INT/FP-only quantization, as the bit-width remains unchanged.
Paper Structure (18 sections, 7 figures, 4 tables, 1 algorithm)

This paper contains 18 sections, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Structure of FP formats.
  • Figure 2: Value Distribution represented in FP8 and INT8.
  • Figure 3: Area differences of INT and FP operators across various bit-widths (32-bit, 16-bit and 8-bit) with TSMC 7nm technology at 0.5GHz. From left to right: Adder, Multiplier, and MAC unit.
  • Figure 4: Quantizing weight tensors from various layers of LLaMA-65B with 4-bit (INT4 vs. FP4) and 8-bit (INT8 vs. FP8). No consistent superior format in 4-bit and INT outperforms FP in 8-bit.
  • Figure 5: Using scales obtained from different calibration sets to quantize unseen input activation tensors. FP8 exhibits a better adaptability to the scale value than INT8.
  • ...and 2 more figures