Table of Contents
Fetching ...

ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

Xinlin Li, Timothy Chou, Josh Fromm, Zichang Liu, Yunjie Pan, Christina Fragouli

TL;DR

ScaleBITS is proposed, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency and developing a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation.

Abstract

Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.

ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs

TL;DR

ScaleBITS is proposed, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency and developing a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation.

Abstract

Post-training weight quantization is crucial for reducing the memory and inference cost of large language models (LLMs), yet pushing the average precision below 4 bits remains challenging due to highly non-uniform weight sensitivity and the lack of principled precision allocation. Existing solutions use irregular fine-grained mixed-precision with high runtime overhead or rely on heuristics or highly constrained precision allocation strategies. In this work, we propose ScaleBITS, a mixed-precision quantization framework that enables automated, fine-grained bitwidth allocation under a memory budget while preserving hardware efficiency. Guided by a new sensitivity analysis, we introduce a hardware-aligned, block-wise weight partitioning scheme, powered by bi-directional channel reordering. We formulate global bitwidth allocation as a constrained optimization problem and develop a scalable approximation to the greedy algorithm, enabling end-to-end principled allocation. Experiments show that ScaleBITS significantly improves over uniform-precision quantization (up to +36%) and outperforms state-of-the-art sensitivity-aware baselines (up to +13%) in ultra-low-bit regime, without adding runtime overhead.
Paper Structure (33 sections, 10 equations, 18 figures, 7 tables, 2 algorithms)

This paper contains 33 sections, 10 equations, 18 figures, 7 tables, 2 algorithms.

Figures (18)

  • Figure 1: ScaleBITS yields a smooth accuracy–compression trade-off beyond the budgets supported by existing methods.
  • Figure 2: Weight sensitivity distribution in Gemma2-9B model.
  • Figure 3: Estimated layer sensitivity using first-order Taylor expansion around a quantized model (top) and the full-precision model (bottom).
  • Figure 4: Overview of the proposed mixed-precision framework. (a) Sensitivity distribution of the original weight matrix. (b) Sensitivity distribution after bi-directional reordering of input and output channels based on sensitivity. (c) Resulting block-wise precision allocation over the reordered weight matrix.
  • Figure 5: The estimated layer sensitivity under (left) uniform-precision and (right) mixed-precision quantization.
  • ...and 13 more figures