Table of Contents
Fetching ...

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

TL;DR

The paper tackles extreme LLM compression by challenging STE-based fine-tuning and introducing PV-Tuning, a representation-agnostic framework that optimizes both discrete weight assignments and continuous parameters via alternating P and V steps. It provides convergence guarantees in restricted cases and introduces a linearized V step plus subspace descent to enable practical, large-step updates on quantized weights. Empirically, PV-tuning achieves state-of-the-art compression-accuracy trade-offs for 1-2 bit representations and attains Pareto-optimal 2-bit quantization for Llama-2 across multiple scales, while preserving compatible inference kernels. The work demonstrates substantial improvements over prior PTQ+fine-tuning approaches, highlights the importance of subspace updates, and outlines future directions for extending to broader quantization niches and activation quantization.

Abstract

There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

TL;DR

The paper tackles extreme LLM compression by challenging STE-based fine-tuning and introducing PV-Tuning, a representation-agnostic framework that optimizes both discrete weight assignments and continuous parameters via alternating P and V steps. It provides convergence guarantees in restricted cases and introduces a linearized V step plus subspace descent to enable practical, large-step updates on quantized weights. Empirically, PV-tuning achieves state-of-the-art compression-accuracy trade-offs for 1-2 bit representations and attains Pareto-optimal 2-bit quantization for Llama-2 across multiple scales, while preserving compatible inference kernels. The work demonstrates substantial improvements over prior PTQ+fine-tuning approaches, highlights the importance of subspace updates, and outlines future directions for extending to broader quantization niches and activation quantization.

Abstract

There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.
Paper Structure (46 sections, 4 theorems, 42 equations, 7 figures, 10 tables, 8 algorithms)

This paper contains 46 sections, 4 theorems, 42 equations, 7 figures, 10 tables, 8 algorithms.

Key Result

Theorem 3.1

Assume $\phi$ is bounded below, and let $x^0\in \mathbb{R}^d_c$. Then (i) $y^k\in \mathbb{R}^d_{\leq c}$ and $x^k \in \mathbb{R}^d_{\leq c}$ for all $k\geq 0$; (ii) $\phi(x^{k+1}) \leq \phi(y^k) \leq \phi(x^k)$ for all $k\geq 0$; and (iii) the sequence $\{\phi(x^{k})\}_{k\geq 0}$ converges.

Figures (7)

  • Figure 1: WikiText-2 perplexity (left) and average zero-shot accuracy (right) of 2-bit quantized Llama 2 models as a function of model size (GiB). See detailed setup in Section \ref{['sect:experiments_maintable']}.
  • Figure 2: (left) L2 errors for 17th layer of Llama 2 7B with different representations. Full model perplexity on WikiText-2 is reported without finetuning (middle) and with fine-tuning (right).
  • Figure 3: PV algorithm (\ref{['alg:pv_alg']}) applied on the very small dimensional ($d=6$) quadratic objective (\ref{['eq:complex_quad_objective']}). The starting point $x^0$ is chosen randomly using the ng algorithm (\ref{['alg:random_point_generation']}).
  • Figure 4: Optimized PV algorithm (\ref{['alg:pv_alg_optimized']}) applied on the quadratic objective (\ref{['eq:complex_quad_objective']}), $d=100$. Number of runs with different random initial points $r=50$.
  • Figure 5: Experiments with Linearized PV algorithm (\ref{['sec:V1']}).
  • ...and 2 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Convergence of the PV method
  • Lemma 3.2
  • Lemma 3.3: Monotonicity
  • Example N.1
  • Example O.1
  • Example O.3
  • Theorem P.1
  • proof