Table of Contents
Fetching ...

Multi-Level Collaboration in Model Merging

Qi Li, Runpeng Yu, Xinchao Wang

TL;DR

This work investigates whether parameter-level model merging can achieve data-level performance parity with prediction-level ensembling in multi-model collaborations beyond traditional two-model, ViT-based, shared-pretraining setups. It develops a theoretical result showing second-order smallness for the gap between merging and ensembling under a weighted-offset constraint, and introduces NeuLig, a validation framework using a lightweight Portland to generate CoopVecs and align merging with ensembling through a joint loss. Empirical results across ViT and ResNet backbones, multiple datasets, and up to seven collaborating models demonstrate that NeuLig can achieve near-identical or superior data-level performance between merging and ensembling, with very small or zero gaps in many configurations. The findings provide a scalable, architecture-agnostic pathway for robust multi-model collaboration, enabling effective sharing and alignment of knowledge across diverse models and data regimes.

Abstract

Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling-commonly viewed as the upper bound for merging-to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between model merging and model ensembling, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed Neural Ligand (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44% vs. ensembling: 95.46%).

Multi-Level Collaboration in Model Merging

TL;DR

This work investigates whether parameter-level model merging can achieve data-level performance parity with prediction-level ensembling in multi-model collaborations beyond traditional two-model, ViT-based, shared-pretraining setups. It develops a theoretical result showing second-order smallness for the gap between merging and ensembling under a weighted-offset constraint, and introduces NeuLig, a validation framework using a lightweight Portland to generate CoopVecs and align merging with ensembling through a joint loss. Empirical results across ViT and ResNet backbones, multiple datasets, and up to seven collaborating models demonstrate that NeuLig can achieve near-identical or superior data-level performance between merging and ensembling, with very small or zero gaps in many configurations. The findings provide a scalable, architecture-agnostic pathway for robust multi-model collaboration, enabling effective sharing and alignment of knowledge across diverse models and data regimes.

Abstract

Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling-commonly viewed as the upper bound for merging-to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between model merging and model ensembling, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed Neural Ligand (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44% vs. ensembling: 95.46%).

Paper Structure

This paper contains 18 sections, 1 theorem, 9 equations, 9 figures, 9 tables.

Key Result

Proposition 1

For $T$ neural networks parameterized by $\boldsymbol{\theta_t}$ (where $t = 1, 2, \dots, T$ and $\forall \boldsymbol{\theta}_t \in \Theta$), assuming $f_{\boldsymbol{\theta_t}}(\cdot)$ is continuous and $\forall (x, y) \in \mathcal{D}$, $f_{\boldsymbol{\theta}}(x, y)$ is (at least) twice differenti

Figures (9)

  • Figure 1: An illustration of Portland, which consists of a linear layer followed by a softmax function.
  • Figure 2: The training process of Portland. The CoopVec is combined separately with the model output and the modified offsets, contributing to two respective terms in the loss function.
  • Figure 3: A toy experiment to verify theoretical feasibility. In this experiment, we merged two models that were fine-tuned on different datasets. Marker shapes represent different methods, while colors indicate different experimental groups, with each group using a distinct combination of datasets. In total, 10 groups are conducted (represented by 10 different colors). Hollow markers for each method indicate the average results across these 10 groups.
  • Figure 4: CoopVec Distribution of different tasks and the corresponding CoopVec Map after training for one epoch.
  • Figure 5: The variation of the diagonal values of CoopVec Map throughout the training process using CLIP-RN50 (top) and CLIP-ViT-B/32 (bottom).
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1