Table of Contents
Fetching ...

Model Merging in the Essential Subspace

Longhua Li, Lei Qi, Qi Tian, Xin Geng

TL;DR

ESM (Essential Subspace Merging) is proposed, a robust framework for effective model merging that mitigates inter-task interference while preserving core task-specific functionality and introduces a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion.

Abstract

Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task's parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.

Model Merging in the Essential Subspace

TL;DR

ESM (Essential Subspace Merging) is proposed, a robust framework for effective model merging that mitigates inter-task interference while preserving core task-specific functionality and introduces a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion.

Abstract

Model merging aims to integrate multiple task-specific fine-tuned models derived from a shared pre-trained checkpoint into a single multi-task model without additional training. Despite extensive research, task interference remains a major obstacle that often undermines the performance of merged models. In this paper, we propose ESM (Essential Subspace Merging) , a robust framework for effective model merging. We begin by performing Principal Component Analysis (PCA) on feature shifts induced by parameter updates. The resulting principal directions span an essential subspace that dominantly influences feature representations. Each task's parameter update matrix is projected onto its respective essential subspace for low-rank decomposition before merging. This methodology mitigates inter-task interference while preserving core task-specific functionality. Furthermore, we introduce a multi-level polarized scaling strategy that amplifies parameters containing critical knowledge and suppresses redundant ones, preventing essential knowledge from being overwhelmed during fusion. Extensive experiments across multiple task sets and model scales demonstrate that our method achieves state-of-the-art performance in multi-task model merging.
Paper Structure (43 sections, 3 theorems, 25 equations, 13 figures, 8 tables, 1 algorithm)

This paper contains 43 sections, 3 theorems, 25 equations, 13 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Given a task matrix $\Delta_W \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$ with its singular value decomposition $\Delta_W = U\Sigma V^\top = \sum_{i=1}^r \sigma_i u_i v_i^\top$. Let $\widehat{\Delta_W} = \sum_{i=1}^k \sigma_i u_i v_i^\top$ be its top-$k$ rank approximation. For an input $x

Figures (13)

  • Figure 1: Mean accuracy of the merged ViT-B/32 models on 8-, 14-, and 20-task benchmarks. The proposed ESM effectively reduces the performance gap to the individually fine-tuned models.
  • Figure 2: Essential Subspace Decomposition (ESD) versus Singular Value Decomposition (SVD). Unlike SVD, which decomposes the task matrix solely based on weights, ESD decomposes them based on feature shift distributions. When truncating components for merging, ESD's expected truncation error is directly related to the magnitude of the discarded eigenvalues and yields higher knowledge retention.
  • Figure 3: Performance evaluation of layer-wise task matrix loading under different ordering strategies on a pre-trained ViT-B/32 backbone. (a) Direct loading of task matrices. (b) Loading with layer-wise norms averaged to reflect directional importance.
  • Figure 4: Performance evaluation on the 8-task benchmark when loading merged task matrices layer-by-layer into a pre-trained backbone under different ordering strategies.
  • Figure 5: (a) The proposed Polarized Scaling uses the norm of parameter updates as scaling factors to amplify essential parameters submerged by redundant ones. (b) This scaling is applied at three distinct levels: across tasks, dimensions, and layers.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof