Table of Contents
Fetching ...

LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

Ivan Vulić, Adam Grycner, Quentin de Laroussilhe, Jonas Pfeiffer

TL;DR

LoRA-Squeeze addresses the challenge of selecting and deploying fixed-rank LoRA adapters by proposing an overparameterized training paradigm: fine-tune with a high source rank $r_{src}$ to capture rich task updates $\Delta W$, then compress to a lower deployment rank $r_{tgt}$ using Randomized SVD. The method yields two operational modes—Post-Squeeze (post-hoc compression) and In-Squeeze (in-tuning rank annealing)—and includes a memory-efficient variant that avoids materializing the full $\Delta W$. Across 13 text tasks and 10 vision-language tasks on the Gemma-3 family, Post-Squeeze often matches or exceeds directly trained $r_{tgt}$-rank LoRA adapters, while In-Squeeze provides the best size–performance trade-offs, with Cont-Squeeze enabling quick recovery after aggressive compression. By decoupling training and deployment ranks and reducing per-rank hyperparameter sweeps, LoRA-Squeeze simplifies deployment and enhances parameter efficiency in practical PEFT settings, with potential extensions to other LoRA variants and automatic rank-selection strategies.

Abstract

Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

TL;DR

LoRA-Squeeze addresses the challenge of selecting and deploying fixed-rank LoRA adapters by proposing an overparameterized training paradigm: fine-tune with a high source rank to capture rich task updates , then compress to a lower deployment rank using Randomized SVD. The method yields two operational modes—Post-Squeeze (post-hoc compression) and In-Squeeze (in-tuning rank annealing)—and includes a memory-efficient variant that avoids materializing the full . Across 13 text tasks and 10 vision-language tasks on the Gemma-3 family, Post-Squeeze often matches or exceeds directly trained -rank LoRA adapters, while In-Squeeze provides the best size–performance trade-offs, with Cont-Squeeze enabling quick recovery after aggressive compression. By decoupling training and deployment ranks and reducing per-rank hyperparameter sweeps, LoRA-Squeeze simplifies deployment and enhances parameter efficiency in practical PEFT settings, with potential extensions to other LoRA variants and automatic rank-selection strategies.

Abstract

Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.
Paper Structure (17 sections, 8 equations, 10 figures, 8 tables, 2 algorithms)

This paper contains 17 sections, 8 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: LoRA-Squeezeafter fine-tuning (Post-Squeeze). We fine-tune with a LoRA with a higher, 'source' LoRA rank $r_{src}$ and then transform it to a lower, 'target' LoRA rank $r_{tgt}$.
  • Figure 2: LoRA-Squeezeduring fine-tuning (In-Squeeze); we can gradually anneal the LoRA rank during fine-tuning by reconstructing the full delta $\Delta W$ from the current LoRA, decompose it to a lower-rank LoRA via Randomized SVD and continue fine-tuning with a lower-rank. It repeats the main Post-Squeeze steps (Figure \ref{['fig:post-squeeze']}) multiple times during fine-tuning using a predetermined annealing scheme.
  • Figure 3: Performance over 3 representative text-based tasks when we do hyperparameter search for the learning rate or LoRA-s only for the highest rank in the figures ($r_{src}=128$), and keep the same lr for direct fine-tuning at all the other (lower) ranks. A simple offline Post-Squeeze method can bypass the hyperparameter search and yield better-performing LoRA-s without any fine-tuning at the lower ranks. Similar patterns are observed for the VL tasks; see the selection of plots in Figure \ref{['fig:subopts_vl']} in Appendix \ref{['app:additional']}. Remark: For the higher results with $r_{tgt}$-rank LoRAs, where a learning rate sweep for $r_{tgt}$ was performed, we refer the reader later to Table \ref{['tab:finetuning_strategies_4b_text']}.
  • Figure 4: Performance difference heatmaps on text tasks for (a) Gemma 3 4B IT and (b) Gemma 3 1B IT. Each heatmap plots the average performance gain of Post-Squeeze from a given source rank $r_{src}$ (y-axis) to a target rank $r_{tgt}$ (x-axis), relative to a baseline LoRA module trained directly at $r_{tgt}$. Red cells indicate a positive gain, signifying that Post-Squeeze outperforms direct fine-tuning.
  • Figure 5: Performance difference between the memory-efficient LoRA-Squeeze and the standard Post-Squeeze variant, averaged over the 13 text-only tasks.
  • ...and 5 more figures