Table of Contents
Fetching ...

Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing

Roman Rausch, David Jansen, Sukhbinder Singh, Román Orús

TL;DR

The paper targets memory-efficient deployment of large language models by improving data-aware SVD compression. It introduces FermiGrad, a gradient-based method that globally optimizes per-layer SVD ranks by soft-truncating singular values with a Fermi function, and PivGa, a lossless secondary compression leveraging gauge freedom via Interpolative Decomposition. Together, these methods achieve better accuracy at fixed model size than uniform rank reductions, with practical trade-offs between speed and compression. The techniques offer a principled, physics-inspired route to high-quality, compact LLMs suitable for edge and resource-constrained settings.

Abstract

Large Language Models (LLMs) are very demanding in terms of their computational resources. Low-rank decompositions of LLM weights, e.g. via Singular Value Decomposition (SVD), is a promising approach for LLM compression, but presents several practical hurdles, e.g. selecting appropriate layer-wise ranks and getting rid of its parameter redundancy. In this work, we present two physics-inspired improvements to SVD LLM compression: (1) \textbf{FermiGrad}, a gradient-descent algorithm that determines globally optimal layer-wise ranks by relaxing the discrete singular-value truncation into a continuous optimization using the Fermi function; (2) \textbf{PivGa}, an additional \textit{lossless} compression of the low-rank factors that exploits the intrinsic gauge freedom in their parametrization.

Globally optimized SVD compression of LLMs via Fermi-function-based rank selection and gauge fixing

TL;DR

The paper targets memory-efficient deployment of large language models by improving data-aware SVD compression. It introduces FermiGrad, a gradient-based method that globally optimizes per-layer SVD ranks by soft-truncating singular values with a Fermi function, and PivGa, a lossless secondary compression leveraging gauge freedom via Interpolative Decomposition. Together, these methods achieve better accuracy at fixed model size than uniform rank reductions, with practical trade-offs between speed and compression. The techniques offer a principled, physics-inspired route to high-quality, compact LLMs suitable for edge and resource-constrained settings.

Abstract

Large Language Models (LLMs) are very demanding in terms of their computational resources. Low-rank decompositions of LLM weights, e.g. via Singular Value Decomposition (SVD), is a promising approach for LLM compression, but presents several practical hurdles, e.g. selecting appropriate layer-wise ranks and getting rid of its parameter redundancy. In this work, we present two physics-inspired improvements to SVD LLM compression: (1) \textbf{FermiGrad}, a gradient-descent algorithm that determines globally optimal layer-wise ranks by relaxing the discrete singular-value truncation into a continuous optimization using the Fermi function; (2) \textbf{PivGa}, an additional \textit{lossless} compression of the low-rank factors that exploits the intrinsic gauge freedom in their parametrization.

Paper Structure

This paper contains 4 sections, 3 figures.

Figures (3)

  • Figure 1: a) Illustration of FermiGrad for a single layer: $F$ is the active Fermi tensor that softens the SVD truncation while $A$ and $B$ are frozen, and $\mu$ provides a slider to select the singular values. b) Optimization space for two 1024$\times$1024 layers (2M parameters) and 1.57M target parameters. The grey area shows the box constraint for the ranks: $1\leq r_l\leq1024$, $l=0,1$. The red crosses denote the hypothesized optimal solution. The blue (light blue) lines show the parameter constraints for regular (secondary) compression. The FermiGrad algorithm starts with full rank and moves along some trajectory to the optimal solution as the penalty term is increased (see text).
  • Figure 2: Inference speed of the PivGa approach compared to PiFa Zhao2025_Pifa and the pure SVD model (without PiFa/PivGa) using random tokens with sequence length 256, batch size 32 on an H200 GPU using Llama-3.1-8B-Instruct in bfloat16.
  • Figure 3: Benchmark of the FermiGrad results (solid lines) for different datasets, compared with uniform compression (dotted lines). The dashed line with square markers indicates PivGa compression for a selected dataset. Parameters: model = Llama-3.1-8B-Instruct; dataset size = 65536 for calibration (max length = 1024), 1024 (max length = 512) for FermiGrad.