Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Anke Tang; Enneng Yang; Li Shen; Yong Luo; Han Hu; Bo Du; Dacheng Tao

Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Anke Tang, Enneng Yang, Li Shen, Yong Luo, Han Hu, Bo Du, Dacheng Tao

TL;DR

The paper tackles sequential model merging without retraining by introducing Orthogonal Projection-based Continual Merging (OPCM), a training-free approach that merges models one at a time using orthogonal projections of weight updates and adaptive time-varying scaling. It proves theoretical properties such as orthogonality of projected updates and bounded drift from the pre-trained base, and demonstrates empirical gains (5–8% average accuracy improvement) on CLIP-ViT models across 8–20 tasks with robustness to task order. The method maintains constant memory $O(|\theta|)$ and mitigates interference between tasks by projecting new task vectors onto subspaces orthogonal to the current merged update, while preserving knowledge through cumulative projections. The results indicate that OPCM scales favorably with model capacity and task count, offering a practical solution for continual multi-task learning in large foundation models and suggesting extensions to language and multimodal domains in future work.

Abstract

Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their specialized capabilities across different tasks and domains. Current model merging techniques focus on merging all available models simultaneously, with weight interpolation-based methods being the predominant approaches. However, these conventional approaches are not well-suited for scenarios where models become available sequentially, and they often suffer from high memory requirements and potential interference between tasks. In this study, we propose a training-free projection-based continual merging method that processes models sequentially through orthogonal projections of weight matrices and adaptive scaling mechanisms. Our method operates by projecting new parameter updates onto subspaces orthogonal to existing merged parameter updates while using an adaptive scaling mechanism to maintain stable parameter distances, enabling efficient sequential integration of task-specific knowledge. Our approach maintains constant memory complexity to the number of models, minimizes interference between tasks through orthogonal projections, and retains the performance of previously merged models through adaptive task vector scaling. Extensive experiments on CLIP-ViT models demonstrate that our method achieves a 5-8% average accuracy improvement while maintaining robust performance in different task orderings.

Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

TL;DR

and mitigates interference between tasks by projecting new task vectors onto subspaces orthogonal to the current merged update, while preserving knowledge through cumulative projections. The results indicate that OPCM scales favorably with model capacity and task count, offering a practical solution for continual multi-task learning in large foundation models and suggesting extensions to language and multimodal domains in future work.

Abstract

Paper Structure (21 sections, 4 theorems, 30 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 4 theorems, 30 equations, 15 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Deep Model Fusion
Continual Learning
Rethinking Model Merging From a Continual Learning Perspective
Problem Setup
Opportunities & Challenges in Continual Merging
Methodology
Theoretical Analysis
Experiments
Experimental Setup
Continual Multi-Task Model Merging
Hyper-Parameter Analysis
Conclusion and Future Work
Proofs
...and 6 more sections

Key Result

Theorem 5.1

Given a sequence of task-specific models $\{f_{\theta^{(t)}}\}_{t=1}^T$ fine-tuned from a pre-trained model $f_{\theta^{(0)}}$, for any time step $t$, the projected task vector $\mathcal{P}^{(t-1)}_{\alpha}(\Delta W^{(t)})$ obtained from the update rule in Eq.(eq:linear_weight_update_rule) is orthog

Figures (15)

Figure 1: Comparison between conventional and continual model merging approaches. (a) Conventional model merging requires simultaneous access to all expert models $\{\theta^{(i)}\}_{i=1}^T$, performing merging in a single step. (b) The continual model merging processes models sequentially as they become available.
Figure 2: Memory complexity of task arithmetic ilharcoEditingModelsTask2023 and our method.
Figure 3: Subspace view. Geometric interpretation of the proposed continual model merging approach, illustrating the orthogonal projection and adaptive scaling mechanisms.
Figure 4: The cosine similarity between task vectors of ViT-B/32.
Figure 5: Performance comparison of ViT models with different architectures across an increasing number of sequential tasks.
...and 10 more figures

Theorems & Definitions (9)

Definition 3.1: Conventional Model Merging
Definition 3.2: Continual Model Merging
Theorem 5.1: Orthogonality of Projected Task Vectors
Theorem 5.2: Bounded Parameter Distance
Corollary 5.3: Preservation of Task Information
Theorem 1.1: General Term Formula
proof
proof : Proof of Theorem \ref{['thm:orthogonality']}
proof : Proof of Theorem \ref{['thm:bounded_distance']}

Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

TL;DR

Abstract

Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (9)