Table of Contents
Fetching ...

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

Bo Xu, Haotian Wu, Hehai Lin, Weiquan Huang, Beier Zhu, Yao Shu, Chengwei Qin

TL;DR

A theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting is provided.

Abstract

Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.

ACE-Merging: Data-Free Model Merging with Adaptive Covariance Estimation

TL;DR

A theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting is provided.

Abstract

Model merging aims to combine multiple task-specific expert models into a single model while preserving generalization across diverse tasks. However, interference among experts, especially when they are trained on different objectives, often leads to significant performance degradation. Despite recent progress, resolving this interference without data access, retraining, or architectural modification remains a fundamental challenge. This paper provides a theoretical analysis demonstrating that the input covariance of each task, which is a key factor for optimal merging, can be implicitly estimated from the parameter differences of its fine-tuned model, even in a fully data-free setting. Building on this insight, we introduce \acem, an Adaptive Covariance Estimation framework that effectively mitigates inter-task interference. Our approach features a principled, closed-form solution that contrasts with prior iterative or heuristic methods. Extensive experiments on both vision and language benchmarks demonstrate that \acem sets a new state-of-the-art among data-free methods. It consistently outperforms existing baselines; for example, \acem achieves an average absolute improvement of 4\% over the previous methods across seven tasks on GPT-2. Owing to its efficient closed-form formulation, \acem delivers superior performance with a modest computational cost, providing a practical and theoretically grounded solution for model merging.
Paper Structure (59 sections, 59 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 59 sections, 59 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of inter-task heterogeneity ($\gamma$) across architectures. ViT-B/16 shows relatively uniform scaling ($\gamma < 0.25$), while RoBERTa-Base exhibits much stronger variance ($\gamma > 0.3$) across layers. This cross-architecture contrast motivates our adaptive scaling mechanism.
  • Figure 2: Comparison between the preliminary closed-form solution $\bar{W}_{\mathrm{pre}}$ and the final merged model $\bar{W}$. Left: The singular value spectrum of $\bar{W}_{\mathrm{pre}}$ is extremely concentrated: the top 5% singular values capture more than 99% of the total energy, indicating severe spectral ill-conditioning. In contrast, $\bar{W}$ exhibits a substantially flatter spectrum. Right: Despite this spectral imbalance, their leading eigenvectors are nearly identical (cosine similarity $\approx 1$), showing that $\bar{W}_{\mathrm{pre}}$ already identifies the correct structural subspace. Spectral Refinement preserves this subspace while restoring a stable and expressive energy distribution.
  • Figure 3: Empirical distributions of $\Delta W_{t}$ across architectures, layer types, and tasks. Each subplot shows the histogram of the entries of $\Delta W_{t}$ for one representative layer from RoBERTa-Base, GPT-2, and ViT-B/16. All distributions exhibit near-zero means and closely resemble Gaussian profiles, providing strong empirical support for our theoretical assumptions (A3) and (A4).
  • Figure 4: RoBERTa-Large (8 tasks). Trace statistics $\{\tau_t\}$ and heterogeneity coefficient $\gamma$.
  • Figure 5: GPT-2 (7 tasks). Larger heterogeneity $\gamma$ compared to RoBERTa, indicating stronger mismatch among task covariances.
  • ...and 2 more figures