Table of Contents
Fetching ...

ReALLM: A general framework for LLM compression and fine-tuning

Louis Leconte, Lisa Bedin, Van Minh Nguyen, Eric Moulines

TL;DR

ReALLM tackles the memory bottleneck of large language models by representing pretrained weights as a high-precision residual plus a vector-quantized latent learned by a residual autoencoder. Only the residual and the autoencoder scales are fine-tuned, enabling efficient training on modest hardware while preserving a compact, decoder-based reconstruction of each weight matrix. By adapting the autoencoder shape to per-matrix patterns and using block-wise/ end-to-end fine-tuning, ReALLM achieves state-of-the-art performance at a 3-bit budget without training and strong results at 2 bits with minimal calibration data. The approach unifies post-training quantization and finetuning under a single framework, offering practical pathways for deploying and updating large LLMs on memory-constrained devices.

Abstract

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on $b$ bits and a neural decoder model $\mathcal{D}_φ$ with its weights on $b_φ$ bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training. With a budget of $2$ bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.

ReALLM: A general framework for LLM compression and fine-tuning

TL;DR

ReALLM tackles the memory bottleneck of large language models by representing pretrained weights as a high-precision residual plus a vector-quantized latent learned by a residual autoencoder. Only the residual and the autoencoder scales are fine-tuned, enabling efficient training on modest hardware while preserving a compact, decoder-based reconstruction of each weight matrix. By adapting the autoencoder shape to per-matrix patterns and using block-wise/ end-to-end fine-tuning, ReALLM achieves state-of-the-art performance at a 3-bit budget without training and strong results at 2 bits with minimal calibration data. The approach unifies post-training quantization and finetuning under a single framework, offering practical pathways for deploying and updating large LLMs on memory-constrained devices.

Abstract

We introduce ReALLM, a novel approach for compression and memory-efficient adaptation of pre-trained language models that encompasses most of the post-training quantization and fine-tuning methods for a budget of <4 bits. Pre-trained matrices are decomposed into a high-precision low-rank component and a vector-quantized latent representation (using an autoencoder). During the fine-tuning step, only the low-rank components are updated. Our results show that pre-trained matrices exhibit different patterns. ReALLM adapts the shape of the encoder (small/large embedding, high/low bit VQ, etc.) to each matrix. ReALLM proposes to represent each matrix with a small embedding on bits and a neural decoder model with its weights on bits. The decompression of a matrix requires only one embedding and a single forward pass with the decoder. Our weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of bits without any training. With a budget of bits, ReALLM achieves state-of-the art performance after fine-tuning on a small calibration dataset.
Paper Structure (22 sections, 1 equation, 4 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 1 equation, 4 figures, 7 tables, 2 algorithms.

Figures (4)

  • Figure 1: Pre-trained matrix from the first block (left; with "structures"), and pre-trained matrix from the last block (right) for two different models. Stronger vertical patterns appear in the first blocks.
  • Figure 2: ReALLM; during the fine-tuning step only low-rank and scales are updated
  • Figure 3: Reconstruction (Frobenius norm) error for layer of type "Q" for all blocks. Quip# tseng2024quip does not take advantage of the structures in the first blocks.
  • Figure 4: Reconstruction (Frobenius norm) error for layer of type "Q" for all blocks of Gemma2b LLM.