Table of Contents
Fetching ...

MASS: MoErging through Adaptive Subspace Selection

Donato Crisostomi, Alessandro Zirilli, Antonio Andrea Gargiulo, Maria Sofia Bucarelli, Simone Scardapane, Fabrizio Silvestri, Iacopo Masi, Emanuele Rodolà

TL;DR

MASS presents a training-free MoErging approach that fuses multiple fine-tuned endpoints by embedding their per-task updates into low-rank subspaces and routing inputs to the most relevant subspaces. It combines a fixed, data-free merging step with an adaptive, input-driven router that selects task subspaces via projection residuals, enabling a second-pass inference atop a shared backbone. Across ViT-CLIP backbones and 8–20 tasks, MASS achieves state-of-the-art results, recovering a large portion of the individual fine-tuned accuracies with only modest inference overhead and storage growth. The method also provides interpretable task vectors and robust performance in batched settings, highlighting its practicality for scalable, data-free model merging in real-world deployments.

Abstract

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

MASS: MoErging through Adaptive Subspace Selection

TL;DR

MASS presents a training-free MoErging approach that fuses multiple fine-tuned endpoints by embedding their per-task updates into low-rank subspaces and routing inputs to the most relevant subspaces. It combines a fixed, data-free merging step with an adaptive, input-driven router that selects task subspaces via projection residuals, enabling a second-pass inference atop a shared backbone. Across ViT-CLIP backbones and 8–20 tasks, MASS achieves state-of-the-art results, recovering a large portion of the individual fine-tuned accuracies with only modest inference overhead and storage growth. The method also provides interpretable task vectors and robust performance in batched settings, highlighting its practicality for scalable, data-free model merging in real-world deployments.

Abstract

Model merging has recently emerged as a lightweight alternative to ensembling, combining multiple fine-tuned models into a single set of parameters with no additional training overhead. Yet, existing merging methods fall short of matching the full accuracy of separately fine-tuned endpoints. We present MASS (MoErging through Adaptive Subspace Selection), a new approach that closes this gap by unifying multiple fine-tuned models while retaining near state-of-the-art performance across tasks. Building on the low-rank decomposition of per-task updates, MASS stores only the most salient singular components for each task and merges them into a shared model. At inference time, a non-parametric, data-free router identifies which subspace (or combination thereof) best explains an input's intermediate features and activates the corresponding task-specific block. This procedure is fully training-free and introduces only a two-pass inference overhead plus a ~2 storage factor compared to a single pretrained model, irrespective of the number of tasks. We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively, establishing a new state-of-the-art. Most notably, MASS recovers up to ~98% of the average accuracy of individual fine-tuned models, making it a practical alternative to ensembling at a fraction of the storage cost.

Paper Structure

This paper contains 37 sections, 1 theorem, 16 equations, 12 figures, 5 tables, 2 algorithms.

Key Result

Proposition 10.1

Let $V \in \mathbb{R}^{d \times k}$ have orthonormal columns spanning a subspace $\mathcal{S} \subseteq \mathbb{R}^d$, and let $\mathbf{a} \in \mathbb{R}^d$. Then the unique minimizer of $\|\mathbf{a} - \mathbf{w}\|_2^2$ over all $\mathbf{w} \in \mathcal{S}$ is

Figures (12)

  • Figure 1: (left) Fine-tuning holds three separate models on different tasks A, B and C. (middle) Model merging produces a single model incorporating task vectors $\{\text{A},\text{B},\text{C}\}$ using a constant function of the input. (right)MASS stores the pretrained model $\theta_\text{pre}$ and the orthogonalized task singular vectors $V_{\perp}^\top$ across tasks. At test time, MASS adaptively performs merging using a routing mechanism that chooses appropriate task vectors for the input $\mathbf{x}$, using a thresholded gating function $g(\mathbf{x})$. The gate is the residual between the activations of $\mathbf{x}$ and their projections onto the span of the right singular vectors $V_{\perp}$.
  • Figure 2: Projection of the activations $\mathbf{z}_{\ell}{}$ onto the span of TSVs $\mathbf{v}_1, \mathbf{v}_2$.
  • Figure 3: Per-layer task accuracies for ViT-B-32 on the 20-task benchmark. Layers starting with 'A' indicate attention layers, while those starting with 'M' refer to MLPs.
  • Figure 4: Normalized task accuracies over models ViT-B-32, ViT-B-16 and ViT-L-14 for the 20 tasks benchmark.
  • Figure 5: Captions obtained by decoding task singular vectors as text as described in \ref{['subsec:exp-decoding-text']}, accompanied by task representative images. Captions produced by the task singular vectors of predictive layers reflect the task content, those obtained by non-predictive ones do not.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Proposition 10.1: Optimality of Orthogonal Projection
  • proof