Model soups need only one ingredient

Alireza Abdollahpoorrostam; Nikolaos Dimitriadis; Adam Hazimeh; Pascal Frossard

Model soups need only one ingredient

Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

TL;DR

MonoSoup tackles the common fine-tuning problem where specialization boosts in-distribution accuracy at the expense of out-of-distribution robustness. It introduces a data-free, post-hoc editing technique that decomposes a single fine-tuned model's layer updates via singular value decomposition, separating high-energy task directions from low-energy residuals and reweighting them with adaptive per-layer coefficients derived from spectral decay and alignment signals. By using entropy-based effective rank to determine the cut between subspaces and anisotropic, layer-wise mixing, MonoSoup matches or surpasses multi-checkpoint Model Soup baselines while avoiding their training and storage costs. The approach yields consistent gains on vision (CLIP/ImageNet) and language (Qwen) benchmarks and proves complementary to Wise-FT, offering a practical and scalable tool for robust deployment of large foundation models.

Abstract

Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer's update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

Model soups need only one ingredient

TL;DR

Abstract

Paper Structure (27 sections, 2 theorems, 24 equations, 13 figures, 5 tables)

This paper contains 27 sections, 2 theorems, 24 equations, 13 figures, 5 tables.

Introduction
Preliminaries
The role of alignment in model merging
MonoSoup
Experiments
Merging Vision Transformers
Merging Large Language Models
Integration with Wise-FT
Analysis and Discussion
Related Work
Conclusion
Comparison with single-model merging Methods
Zero-shot initialization (ZS init)
Connection between $R$ and $\cos\alpha$
Low-Energy Directions
...and 12 more sections

Key Result

Lemma 1

If $k > 1$, then

Figures (13)

Figure 1: Performance and alignment analysis of Model Stock on 2,409 pairwise combinations of CLIP ViT-B/32 models fine-tuned on ImageNet. (\ref{['fig:model_stock_scatter_sub']}) Scatter plot of ID vs. OOD performance relative to the better constituent model. (\ref{['fig:task_vector_underperforming']}) and (\ref{['fig:task_vector_superior']}): Layer-wise cosine similarity for low-performing and high-performing, respectively. Stronger alignment coincides with consistent gains, highlighting that alignment can serve as a key predictor of merging success.
Figure 2: Performance of Similarity-Filtered Greedy Soup (SFGS). Evaluated on CLIP ViT-B/32 checkpoints, SFGS achieves competitive ID and OOD performance relative to validation-based greedy soup. This supports the finding that geometric alignment is a key indicator of merging effectiveness.
Figure 3: Effect of truncating low-energy components on different benchmarks. (\ref{['fig:TA_ViT_B_32_main']}) On the 20-task vision benchmark, performance saturates after retaining only a small number of singular values, consistent with prior reports that low-rank updates suffice. (\ref{['fig:clip_on_imagenet_id_vs_ood_main']}) On ImageNet with natural OOD shifts, truncation substantially reduces both ID and OOD accuracy, even when preserving 95% of spectral energy. This highlights that, in large-scale fine-tuning, low-energy directions carry critical information for generalization and cannot simply be removed. See \ref{['app:Low-Energy Directions']} for further details.
Figure 4: MonoSoup integrated with Wise-FT on CLIP ViT-B/32. MonoSoup improves ID and OOD accuracy across individual checkpoints. When combined with Wise-FT, the Pareto fronts consistently dominate those of Wise-FT and LiNeS, showing that MonoSoup provides a stronger endpoint for interpolation-based robustness.
Figure 5: Component Analysis. Effect of varying the variance threshold $R$ and the contributions of each term in the coefficient $\lambda^\ell=\lambda^\ell_{\text{low}}$ on CLIP ViT-L/14. Results are stable across a wide range of $R$ values, and both the spectral decay and cosine overlap components contribute meaningfully to the final balance between ID and OOD performance.
...and 8 more figures

Theorems & Definitions (4)

Lemma 1
proof
Theorem 1
proof

Model soups need only one ingredient

TL;DR

Abstract

Model soups need only one ingredient

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (4)