Table of Contents
Fetching ...

Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

Zecheng Pan, Zhikang Chen, Ding Li, Min Zhang, Sen Cui, Hongshuo Jin, Luqi Tao, Yi Yang, Deheng Ye, Yu Zhang, Tingting Zhu, Tianling Ren

TL;DR

This work tackles the problem of merging fine-tuned task-specific models without accessing historical data or retraining. It introduces Optimal Transport-based Masked Fusion (OTMF), which uses learnable masks guided by Sinkhorn-distance distribution alignment to fuse task vectors while preserving the semantic geometry of each task. The method enables continual fusion with constant memory by reusing the merged model and only updating lightweight masks and, optionally, a classification head, yielding strong accuracy and robustness across vision and language benchmarks. Empirically, OTMF achieves state-of-the-art performance in both accuracy and efficiency, demonstrating practical value for scalable, replay-free multi-task deployment in privacy- or resource-constrained settings.

Abstract

Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.

Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

TL;DR

This work tackles the problem of merging fine-tuned task-specific models without accessing historical data or retraining. It introduces Optimal Transport-based Masked Fusion (OTMF), which uses learnable masks guided by Sinkhorn-distance distribution alignment to fuse task vectors while preserving the semantic geometry of each task. The method enables continual fusion with constant memory by reusing the merged model and only updating lightweight masks and, optionally, a classification head, yielding strong accuracy and robustness across vision and language benchmarks. Empirically, OTMF achieves state-of-the-art performance in both accuracy and efficiency, demonstrating practical value for scalable, replay-free multi-task deployment in privacy- or resource-constrained settings.

Abstract

Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.

Paper Structure

This paper contains 35 sections, 12 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Left: OTMF captures common information between pre/post weights while reducing distribution shift. Middle: T-SNE visualizations show that OTMF yields output distributions closely aligned with the pre model's distributions, outperforming Task-wise AdaMerging. Right: OTMF outperforms other sequential methods in average accuracy while using less CUDA memory than Task-Wise AdaMerging, highlighting its advantages in both performance and efficiency.
  • Figure 2: Left: Overview of the OTMF continual merging pipeline. Given task vectors from the previous merged model (pre) and the next task's SFT model (post), learnable masks modulate their contributions. The masked vectors are fused and combined with a frozen pretrained model to form the new merged model, which serves as the pre model for the next step, enabling continual task accumulation. Right: The OT loss aligns the merged model's features with those of the pre and post models via a cost matrix over feature distributions. This guides mask updates to ensure distributional consistency and knowledge retention.
  • Figure 3: The Distribution Visualization of Continual OT Mask Merging
  • Figure 4: The Distribution Visualization of Continual Task-wise Adamerging
  • Figure 5: Ablation Study of OTMF
  • ...and 5 more figures