Table of Contents
Fetching ...

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé

TL;DR

The paper tackles the challenge of balancing capabilities across many tasks in very large language models by recycling suboptimal checkpoints through training-free linear merging. It introduces a CMA-ES-based optimization to find the weights for a linear model soup that maximizes a macro-average fitness across tasks, formalized as $\theta_{\text{mrg}} = \sum_i \alpha_i \theta_i$ with $\sum_i \alpha_i = 1$. Experiments with 16 checkpoints across two- and three-task settings show that the optimized merges achieve Pareto-optimal tradeoffs and often outperform both individual checkpoints and simple baselines, while revealing that most checkpoints contribute to the final model. This approach offers a scalable, cost-efficient method to recycle imperfect checkpoints in frontier-model workflows, enabling training-free optimization of task tradeoffs at scale.

Abstract

Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging "generalist" models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in such an optimal model that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.

If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs

TL;DR

The paper tackles the challenge of balancing capabilities across many tasks in very large language models by recycling suboptimal checkpoints through training-free linear merging. It introduces a CMA-ES-based optimization to find the weights for a linear model soup that maximizes a macro-average fitness across tasks, formalized as with . Experiments with 16 checkpoints across two- and three-task settings show that the optimized merges achieve Pareto-optimal tradeoffs and often outperform both individual checkpoints and simple baselines, while revealing that most checkpoints contribute to the final model. This approach offers a scalable, cost-efficient method to recycle imperfect checkpoints in frontier-model workflows, enabling training-free optimization of task tradeoffs at scale.

Abstract

Model merging has shown great promise at combining expert models, but the benefit of merging is unclear when merging "generalist" models trained on many tasks. We explore merging in the context of large (~100B) models, by recycling checkpoints that exhibit tradeoffs among different tasks. Such checkpoints are often created in the process of developing a frontier model, and the suboptimal ones are usually discarded. Given a pool of model checkpoints obtained from different training runs (e.g., different stages, objectives, hyperparameters, and data mixtures), which naturally show tradeoffs across different language capabilities (e.g., instruction following vs. code generation), we investigate whether merging can recycle such suboptimal models into a Pareto-optimal one. Our optimization algorithm tunes the weight of each checkpoint in a linear combination, resulting in such an optimal model that outperforms both individual models and merge-based baselines. Further analysis shows that good merges tend to include almost all checkpoints with non-zero weights, indicating that even seemingly bad initial checkpoints can contribute to good final merges.

Paper Structure

This paper contains 29 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: An overview of our setup. Given models obtained from different LLM training runs, we optimize linear merging weightings ($\alpha_1, \alpha_2, \alpha_3$) via iterative search to obtain a model with minimal task tradeoffs. Each represents a single model, with a to designate its performance on the two tasks. The purple color indicates a Pareto-optimal model, achieving a good balance between the two tasks without being dominated by other models. We show tradeoffs between only two tasks since it is easier to visualize.
  • Figure 2: Performance of individual models over the seven tasks covering different capabilities. Models 1-8 are the result of supervised finetuning runs, while 8-16 from preference optimization. Held-out tasks (MT-Bench and LBPP) are used to evaluate the resulting merges to make sure the merge optimization process does not overfit to the held-in tasks that we aim to minimize tradeoffs over. MT-Bench rating is scaled by a factor of 10 for better visualization. The exact numbers are in \ref{['tab:combined-info']} in \ref{['app:ckpt-info']}.
  • Figure 3: Performance tradeoffs with different merging approaches over different pairwise combinations. Shaded areas represent 95% confidence interval of the best-fit line computed over individual checkpoint scores (shown in green).
  • Figure 4: Spearman's rank correlation between task pairs. It is easy to see how some tasks exhibit strong performance tradeoffs, such as MBPP-IFEval and MMLU-Pro/MUSR.
  • Figure 5: Performance of different merge approaches when minimizing the tradeoffs across three tasks: MBPP, IFEVal, and GSM8K. Dashed red lines represent the best individual model at the corresponding task. It is clear that search-optimized merging can well balance the performance over the three tasks. Bars corresponding to merging are hatched to differentiate from individual models.
  • ...and 4 more figures