Table of Contents
Fetching ...

Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

Alireza Behtash, Marijan Fofonjka, Ethan Baird, Tyler Mauer, Hossein Moghimifam, David Stout, Joel Dennison

TL;DR

EWQ introduces an entropy-driven, architecture- and size-agnostic approach to post-training weight-only quantization for large language models. By computing block-level entropy and applying a threshold-based, mixed-precision strategy, EWQ preserves accuracy (e.g., MMLU within 0.5%) while saving memory (up to ~18–22%), across models from 1.6B to 70B parameters. FastEWQ provides a classifier-based, zero weight-download variant with $O(1)$ decision time and about 80% accuracy, enabling near-instant deployment decisions. Together, EWQ and FastEWQ demonstrate a universal, deployment-friendly path to efficient LLM inference on resource-constrained hardware, with potential perplexity benefits from quantization regularization and broad applicability beyond a single architecture.

Abstract

We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures -- from 1.6B to 70B parameters -- and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size

TL;DR

EWQ introduces an entropy-driven, architecture- and size-agnostic approach to post-training weight-only quantization for large language models. By computing block-level entropy and applying a threshold-based, mixed-precision strategy, EWQ preserves accuracy (e.g., MMLU within 0.5%) while saving memory (up to ~18–22%), across models from 1.6B to 70B parameters. FastEWQ provides a classifier-based, zero weight-download variant with decision time and about 80% accuracy, enabling near-instant deployment decisions. Together, EWQ and FastEWQ demonstrate a universal, deployment-friendly path to efficient LLM inference on resource-constrained hardware, with potential perplexity benefits from quantization regularization and broad applicability beyond a single architecture.

Abstract

We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures -- from 1.6B to 70B parameters -- and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.

Paper Structure

This paper contains 39 sections, 19 equations, 7 figures, 14 tables, 2 algorithms.

Figures (7)

  • Figure 1: Entropy distribution of Meta-Llama-3.1-8B-Instruct model weights with block number. The optimal quantization requires those blocks with lower entropy to be quantized first.
  • Figure 2: Diagrams showing the distribution of features for the number of blocks (num_blocks), execution index (exec_index), number of parameters (num_parameters), and quantization level.
  • Figure 3: Correlation matrix for features num_blocks, exec_index, num_parameters, and quantization level (quantized or not)
  • Figure 4: Pie chart showing the distribution of quantization types in the dataset. The distribution consists of 407 raw blocks, 232 8-bit blocks, and 61 4-bit blocks.
  • Figure 5: Bar plot illustrating the feature importance scores from the random forest Classifier trained on the model dataset. The plot highlights the relative contribution of each feature (num_parameters, exec_index, and num_blocks) in determining whether to classify transformer blocks for quantization.
  • ...and 2 more figures