Table of Contents
Fetching ...

On the Compressibility of Quantized Large Language Models

Yu Mao, Weilan Wang, Hongchao Du, Nan Guan, Chun Jason Xue

TL;DR

This work investigates the compressibility of quantized large language models (LLMs) and how entropy coding interacts with model accuracy. It analyzes the trade-offs between information-theoretic entropy and compression, proposing that higher entropy can boost accuracy but reduce compressibility under a uniform-information assumption, and explores outlier-aware quantization to preserve critical information. Empirical results demonstrate that entropy coding can yield substantial additional compression beyond INT8 quantization (e.g., up to ~8× for weights and ~16× for activations in some setups), while techniques like SmoothQuant maintain accuracy comparable to channel-wise quantization. Practically, entropy coding can also significantly reduce model loading time (40–60% in tested scenarios), suggesting a viable path for deploying quantized LLMs on memory-constrained devices with reduced data movement and latency.

Abstract

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.

On the Compressibility of Quantized Large Language Models

TL;DR

This work investigates the compressibility of quantized large language models (LLMs) and how entropy coding interacts with model accuracy. It analyzes the trade-offs between information-theoretic entropy and compression, proposing that higher entropy can boost accuracy but reduce compressibility under a uniform-information assumption, and explores outlier-aware quantization to preserve critical information. Empirical results demonstrate that entropy coding can yield substantial additional compression beyond INT8 quantization (e.g., up to ~8× for weights and ~16× for activations in some setups), while techniques like SmoothQuant maintain accuracy comparable to channel-wise quantization. Practically, entropy coding can also significantly reduce model loading time (40–60% in tested scenarios), suggesting a viable path for deploying quantized LLMs on memory-constrained devices with reduced data movement and latency.

Abstract

Deploying Large Language Models (LLMs) on edge or mobile devices offers significant benefits, such as enhanced data privacy and real-time processing capabilities. However, it also faces critical challenges due to the substantial memory requirement of LLMs. Quantization is an effective way of reducing the model size while maintaining good performance. However, even after quantization, LLMs may still be too big to fit entirely into the limited memory of edge or mobile devices and have to be partially loaded from the storage to complete the inference. In this case, the I/O latency of model loading becomes the bottleneck of the LLM inference latency. In this work, we take a preliminary step of studying applying data compression techniques to reduce data movement and thus speed up the inference of quantized LLM on memory-constrained devices. In particular, we discussed the compressibility of quantized LLMs, the trade-off between the compressibility and performance of quantized LLMs, and opportunities to optimize both of them jointly.
Paper Structure (13 sections, 3 theorems, 3 equations, 5 figures, 5 tables)

This paper contains 13 sections, 3 theorems, 3 equations, 5 figures, 5 tables.

Key Result

Theorem 3.1

The entropy of data increases with the uniformity of its distribution, reaching a maximum when the distribution is completely uniform. shannon1948mathematical.

Figures (5)

  • Figure 1: A complete lossy compression process involves two steps: quantization and entropy coding. The latter has long been overlooked in current model compression research. For OPTs, INT8 quantization achieves 4x compression ratio, while entropy coding achieves 2x for weights and 4x for activations.
  • Figure 2: The relationship between quantized weights / activations, entropy, accuracy, and compressibility. Achieving high accuracy and high compressibility is contradictory in theory.
  • Figure 3: Weight distributions of the original data matrix, tensor-wise and channel-wise quantized matrix.
  • Figure 4: Compression ratios on tensor-wise and channel-wise quantized model.
  • Figure 5: Accuracy on four downstream tasks of W8A8 (Tensor-wise), Smoothquant (Tensor-wise), and LLM.int8() (Channel-wise).

Theorems & Definitions (5)

  • Definition 3.1
  • Theorem 3.1
  • Theorem 3.2
  • Conjecture 3.1
  • Corollary 3.2.1