Table of Contents
Fetching ...

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Hao Mark Chen, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan

TL;DR

This work addresses the challenge of scaling model merging to large pools of open-source, partly unknown fine-tuned checkpoints while maintaining robustness to irrelevant models. It reframes merging as constrained optimization over the convex hull of checkpoints and leverages a Frank-Wolfe–style iterative procedure that selects the most relevant model via a linear minimization oracle and merges it with a stable, feasible update. The proposed FW-Merging framework includes design options such as Hard vs Soft FW and Task-wise vs Layer-wise LMO, achieving strong empirical gains on both language and vision tasks and showing constant memory overhead even as the model pool grows. The results demonstrate that FW-Merging can outperform data-free and data-informed baselines as well as traditional MTL approaches, offering a scalable, data-efficient alternative for merging diverse open-source models with practical impact for multi-task deployment. The work also provides open-source code to facilitate adoption in real-world settings.

Abstract

Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models. Our code is open-sourced at github.com/hmarkc/FW-Merging.

FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

TL;DR

This work addresses the challenge of scaling model merging to large pools of open-source, partly unknown fine-tuned checkpoints while maintaining robustness to irrelevant models. It reframes merging as constrained optimization over the convex hull of checkpoints and leverages a Frank-Wolfe–style iterative procedure that selects the most relevant model via a linear minimization oracle and merges it with a stable, feasible update. The proposed FW-Merging framework includes design options such as Hard vs Soft FW and Task-wise vs Layer-wise LMO, achieving strong empirical gains on both language and vision tasks and showing constant memory overhead even as the model pool grows. The results demonstrate that FW-Merging can outperform data-free and data-informed baselines as well as traditional MTL approaches, offering a scalable, data-efficient alternative for merging diverse open-source models with practical impact for multi-task deployment. The work also provides open-source code to facilitate adoption in real-world settings.

Abstract

Model merging has emerged as a promising approach for multi-task learning (MTL), offering a data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned foundation models, existing model merging methods face two key limitations: (i) They are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) They struggle to scale effectively when merging numerous model checkpoints. To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by Frank-Wolfe optimization, our approach iteratively selects the most relevant model in the pool to minimize a linear approximation of the objective function and then executes a local merging similar to the Frank-Wolfe update. The objective function is designed to capture the desired behavior of the target-merged model, while the fine-tuned candidate models define the constraint set. More importantly, FW-Merging serves as an orthogonal technique for existing merging methods, seamlessly integrating with them to further enhance accuracy performance. Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, while maintaining constant memory overhead, unlike the linear overhead of data-informed merging methods. Compared with the state-of-the-art approaches, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models. Our code is open-sourced at github.com/hmarkc/FW-Merging.

Paper Structure

This paper contains 37 sections, 3 theorems, 22 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

The optimization problems in equations eq:fw_orig and eq:fw_conv are equivalent.

Figures (5)

  • Figure 1: Performance scaling of FW-Merging across CV tasks. (a) demonstrate robustness to irrelevant models, while (b) show improved performance with relevant models. (c) analyzes performance degradation when incorporating a noisy model initialized from a different pre-trained checkpoint. Detailed results and experimental setup are discussed in Section \ref{['exp:scaling']}.
  • Figure 2: Illustration of model merging methods. $\Theta_A$ is an irrelevant model, while $\Theta_B$ and $\Theta_C$ are relevant models. Darker regions indicate higher objective function loss. Task Arithmetic treats all task vectors equally, failing to move optimally. Adamerging assigns different coefficients, moving towards more desirable direction but suffer from slow convergence due to interference from $\Theta_aA$. FW-Merging iteratively selects the most relevant model to merge and adapts step sizes, efficiently reaching the optimum after $T$ iterations.
  • Figure 3: Linear Approximation of the Objective Function of Model Checkpoints Across Different Tasks in a Frank-Wolfe Iteration. The x-axis represents the checkpoints, and each graph shows the linear approximation result for each task.
  • Figure 4: Ablation on FW-Merging . (a) reports accuracies on the vision benchmark, while (b) on vision and language benchmarks.
  • Figure 5: Performance vs. #Data Samples.

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Theorem 1: Convergence Rate of Soft FW
  • proof