Table of Contents
Fetching ...

Label-Free Cross-Task LoRA Merging with Null-Space Compression

Wonyoung Lee, Wooseong Jeong, Kuk-Jin Yoon

Abstract

Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor $A$ in $ΔW = BA$ compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

Label-Free Cross-Task LoRA Merging with Null-Space Compression

Abstract

Model merging combines independently fine-tuned checkpoints without joint multi-task training. In the era of foundation-model, fine-tuning with Low-Rank Adaptation (LoRA) is prevalent, making LoRA merging a promising target. Existing approaches can work in homogeneous settings where all target tasks are classification but often fail when tasks span classification and regression. Approaches using entropy-based surrogates do not apply to regression and are costly for large language models due to long token sequences. We introduce Null-Space Compression (NSC) Merging, a label-free, output-agnostic method that sets merge weights from adapter geometry. Our key observation is that during LoRA finetuning the down-projection factor in compresses its null space, and the compression correlates with performance. NSC uses this as an optimization signal for merging that can generalize across classification, regression, and sequence generation. NSC achieves state-of-the-art performance across twenty heterogeneous vision tasks with balanced gains where prior methods overfit subsets of tasks. It also outperforms baselines on six NLI benchmarks and on vision-language evaluations for VQA and image captioning, demonstrating scalability and effectiveness.

Paper Structure

This paper contains 31 sections, 1 theorem, 19 equations, 4 figures, 14 tables, 1 algorithm.

Key Result

Proposition 1

For any LoRA update $\Delta \bm W = \bm B_k \bm A_k$, and $\bm{z} \in \mathbb{R}^d$, the following inequality holds: where $C_k = \sigma_{\min}(\bm{B}_k)\,\sigma_{\min}(\bm{A}_k)$.

Figures (4)

  • Figure 1: Visualization of validation loss, task performance, and the null-space ratio for an output projection layer during LoRA fine-tuning on (a) image classification and (b) depth estimation.
  • Figure 2: Classification accuracy versus the null-space ratio. Accuracy is averaged within quartiles of the ratio across tasks. Left shows fine-tuned experts, right shows the merged model.
  • Figure 3: Extended visualization of validation loss, task performance, and the null-space ratio during LoRA fine-tuning on (a) image classification and (b) depth estimation. We additionally show the null-space ratio trajectories of LoRA at query, key, value projection of self-attention module within a single transformer block, further illustrating the null-space compression phenomenon.
  • Figure 4: Null-space ratio of each transformer block during LoRA fine-tuning for (a) image classification and (b) depth estimation. Values are averaged across transformer blocks, showing that null-space compression consistently occurs throughout the model.

Theorems & Definitions (2)

  • Proposition 1: Adapter effect lower bound
  • proof