Table of Contents
Fetching ...

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Lu Li, Tianyu Zhang, Zhiqi Bu, Suyuchen Wang, Huan He, Jie Fu, Yonghui Wu, Jiang Bian, Yong Chen, Yoshua Bengio

TL;DR

MAP offers a low-cost framework for multi-task merging by replacing expensive evaluations with a quadratic surrogate per task: $ ilde{M}_n(oldsymbol{c}) = rac{1}{2}oldsymbol{c}^ opoldsymbol{A}_noldsymbol{c} + oldsymbol{b}_n^ opoldsymbol{c} + e_n$, where $oldsymbol{A}_n = oldsymbol{V}^ opoldsymbol{H}_n(m{ heta}_{ ext{pre}})oldsymbol{V}$ and $oldsymbol{b}_n = oldsymbol{V}^ op abla M_n(m{ heta}_{ ext{pre}})$. By evaluating a small set of scaling vectors $oldsymbol{c}$, MAP fits these surrogates and then uses a MOOP algorithm such as NSGA-III to approximate amortized Pareto fronts without gradient-based retraining. To handle many tasks and limited resources, the paper introduces Nested MAP (NMMAP) with $O(N\\log N)$ evaluations and Bayesian MAP (BMAP) with adaptive sampling. The method is validated across vision and language tasks (ViT/CLIP, ResNet, Llama) and is shown to produce diverse, well-distributed Pareto fronts that can outperform direct search and plug into other task-vector merging methods. The authors provide code and demonstrate practical impact for deploying trade-off-aware, private-data merging at scale, enabling flexible, user-preferred balancing of task objectives with minimal computational overhead.

Abstract

Model merging has emerged as an effective approach to combine multiple single-task models into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during the merging process. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel and low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP efficiently identifies a Pareto set of scaling coefficients for merging multiple models, reflecting the trade-offs involved. It amortizes the substantial computational cost of evaluations needed to estimate the Pareto front by using quadratic approximation surrogate models derived from a pre-selected set of scaling coefficients. Experimental results on vision and natural language processing tasks demonstrate that MAP can accurately identify the Pareto front, providing practitioners with flexible solutions to balance competing task objectives. We also introduce Bayesian MAP for scenarios with a relatively low number of tasks and Nested MAP for situations with a high number of tasks, further reducing the computational cost of evaluation.

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

TL;DR

MAP offers a low-cost framework for multi-task merging by replacing expensive evaluations with a quadratic surrogate per task: , where and . By evaluating a small set of scaling vectors , MAP fits these surrogates and then uses a MOOP algorithm such as NSGA-III to approximate amortized Pareto fronts without gradient-based retraining. To handle many tasks and limited resources, the paper introduces Nested MAP (NMMAP) with evaluations and Bayesian MAP (BMAP) with adaptive sampling. The method is validated across vision and language tasks (ViT/CLIP, ResNet, Llama) and is shown to produce diverse, well-distributed Pareto fronts that can outperform direct search and plug into other task-vector merging methods. The authors provide code and demonstrate practical impact for deploying trade-off-aware, private-data merging at scale, enabling flexible, user-preferred balancing of task objectives with minimal computational overhead.

Abstract

Model merging has emerged as an effective approach to combine multiple single-task models into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during the merging process. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel and low-compute algorithm, Model Merging with Amortized Pareto Front (MAP). MAP efficiently identifies a Pareto set of scaling coefficients for merging multiple models, reflecting the trade-offs involved. It amortizes the substantial computational cost of evaluations needed to estimate the Pareto front by using quadratic approximation surrogate models derived from a pre-selected set of scaling coefficients. Experimental results on vision and natural language processing tasks demonstrate that MAP can accurately identify the Pareto front, providing practitioners with flexible solutions to balance competing task objectives. We also introduce Bayesian MAP for scenarios with a relatively low number of tasks and Nested MAP for situations with a high number of tasks, further reducing the computational cost of evaluation.
Paper Structure (65 sections, 3 theorems, 22 equations, 15 figures, 9 tables, 3 algorithms)

This paper contains 65 sections, 3 theorems, 22 equations, 15 figures, 9 tables, 3 algorithms.

Key Result

Corollary 1

Under Assumption ass: taylor, for each task $n = 1, \dots, N$, the optimization problem eq:Ab optimal is equivalent to solving a linear regression where the predictors include all quadratic, interaction, linear, and constant terms of $\mathbf{c}$. The closed-form solution for the parameters is given where $\mathbf{C}_n(\mathbf{c}) = (c_1^2, c_2^2, \ldots, c_N^2, c_1 c_2, c_1 c_3, \ldots, c_{N-1} c

Figures (15)

  • Figure 1: Illustration of the overall process of MAP for the case of two tasks. Step 1: Select 2 tasks and compute their corresponding task vectors. Step 2: Sample a few scaling coefficients $\mathbf{c}$ and query the evaluation metrics for each task, respectively. Step 3: Use the quadratic model as a surrogate model to approximate the mapping $\mathbf{c} \rightarrow \textit{metrics}$. Step 4: Use the NSGA-III algorithm with the surrogate objective functions to find amortized Pareto fronts. (a): Contour plot of the actual accuracy landscape for the ViT-B/32 model dosovitskiy2020image obtained from 100 scaling coefficients (sampled uniformly) evaluated on the SUN397 sun397 and Cars cars datasets. (b): Contour plot of the fitted quadratic functions. Red lines represent the Pareto front in the decision variable $(c_1, c_2)$ space. (c): Example of the resulting Pareto fronts. The Pareto front (Grid search) is regarded as the ground truth given the sufficient number of grid points evaluated. The Pareto front (MAP, predicted) is the amortized Pareto front. The Pareto front (MAP, real) corresponds to the same $\{(c_1,c_2)\}$ but is re-evaluated to obtain the ground truth metrics for comparison. The yellow lines indicate the evaluated performance of the fine-tuned single-task models.
  • Figure 2: Left: Comparison of our method with direct search for merged ViT-B/32 models dosovitskiy2020image, based on evaluation results of 250 combinations of scaling coefficients. Our method identifies a more diverse set of solutions across eight tasks within the same computational budget. Both methods aim to maximize the performance of one task while ensuring that all other tasks meet a minimum threshold of $40\%$. The bar plot displays the maximized accuracy for each task. Right: When the threshold is increased to $65\%$ of the single-task model's performance, the brute-force direct search method fails to find any feasible solutions within the same computational budget.
  • Figure 3: Density plot of the absolute values of the weight matrices of the 8 task vectors.
  • Figure 4: (a): Utilize Bayesian optimization to guide the sampling of scaling coefficients according to uncertainty distribution; (b): An example of nested model merging for $N=8$ models.
  • Figure 5: The Pareto fronts obtained using MAP with Task Arithmetic, MAP with Task Arithmetic and DARE. We sampled 10 Pareto solutions from the predicted front by MAP and evaluated them to obtain the real values. We plotted the results obtained using TIES-merging, Task Arithmetic (TA) with a single scalar for all tasks, Task Arithmetic with preferences as scalars, TA combined with DARE (DARE-TA), TIES-merging combined with DARE (DARE-TIES), and SLERP.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Definition 1: Pareto dominance
  • Definition 2: Pareto optimal solutions
  • Corollary 1: Closed-form Solution for Surrogate Model Parameters
  • Corollary 2
  • Corollary 3: Closed-form Solution for Surrogate Model Parameters