Table of Contents
Fetching ...

Non-Uniform Parameter-Wise Model Merging

Albert Manuel Orozco Camacho, Stefan Horoi, Guy Wolf, Eugene Belilovsky

TL;DR

This paper tackles merging neural networks trained with different initializations by moving beyond uniform parameter averaging. It introduces NP Merge, which learns parameter-wise interpolation weights $\boldsymbol{\alpha}_i$ after alignment, using gradient descent to form $W_i = \boldsymbol{\alpha}_i \odot W^{A}_i + (\mathbf{1}-\boldsymbol{\alpha}_i)\odot W^{B'}_i$, with $\alpha_i = \sigma(\alpha_i^{pre})$ ensuring $0<\alpha<1$. The method extends to multiple models via successive pairwise merges and demonstrates strong performance across CIFAR-10/100 and ImageNet-200 in both same- and cross-distribution settings, often outperforming baselines and approaching ensemble accuracy. While NP Merge introduces extra memory and computation due to gradient updates on $\boldsymbol{\alpha}$, the gains in accuracy and robustness, especially in limited-data or federated-like scenarios, justify the overhead and open avenues to connect learned coefficients with Fisher-information priors for further improvements.

Abstract

Combining multiple machine learning models has long been a technique for enhancing performance, particularly in distributed settings. Traditional approaches, such as model ensembles, work well, but are expensive in terms of memory and compute. Recently, methods based on averaging model parameters have achieved good results in some settings and have gained popularity. However, merging models initialized differently that do not share a part of their training trajectories can yield worse results than simply using the base models, even after aligning their neurons. In this paper, we introduce a novel approach, Non-uniform Parameter-wise Model Merging, or NP Merge, which merges models by learning the contribution of each parameter to the final model using gradient-based optimization. We empirically demonstrate the effectiveness of our method for merging models of various architectures in multiple settings, outperforming past methods. We also extend NP Merge to handle the merging of multiple models, showcasing its scalability and robustness.

Non-Uniform Parameter-Wise Model Merging

TL;DR

This paper tackles merging neural networks trained with different initializations by moving beyond uniform parameter averaging. It introduces NP Merge, which learns parameter-wise interpolation weights after alignment, using gradient descent to form , with ensuring . The method extends to multiple models via successive pairwise merges and demonstrates strong performance across CIFAR-10/100 and ImageNet-200 in both same- and cross-distribution settings, often outperforming baselines and approaching ensemble accuracy. While NP Merge introduces extra memory and computation due to gradient updates on , the gains in accuracy and robustness, especially in limited-data or federated-like scenarios, justify the overhead and open avenues to connect learned coefficients with Fisher-information priors for further improvements.

Abstract

Combining multiple machine learning models has long been a technique for enhancing performance, particularly in distributed settings. Traditional approaches, such as model ensembles, work well, but are expensive in terms of memory and compute. Recently, methods based on averaging model parameters have achieved good results in some settings and have gained popularity. However, merging models initialized differently that do not share a part of their training trajectories can yield worse results than simply using the base models, even after aligning their neurons. In this paper, we introduce a novel approach, Non-uniform Parameter-wise Model Merging, or NP Merge, which merges models by learning the contribution of each parameter to the final model using gradient-based optimization. We empirically demonstrate the effectiveness of our method for merging models of various architectures in multiple settings, outperforming past methods. We also extend NP Merge to handle the merging of multiple models, showcasing its scalability and robustness.

Paper Structure

This paper contains 18 sections, 6 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: We plot the number of merged models against the merged model's accuracy. The top plot shows the results for the balanced data setting, and the bottom plot shows the results for the unbalanced data setting. This setting corresponds to ResNet20$\times$8 trained on CIFAR100. $\alpha$ parameters were optimized in the same way as described in Section \ref{['subsec:experimental_details']}.