Table of Contents
Fetching ...

How does the optimizer implicitly bias the model merging loss landscape?

Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw

TL;DR

This work investigates why independently trained models merge poorly or well by examining how optimization dynamics sculpt the merging loss landscape. It introduces the effective noise scale as a unifying factor that aggregates learning rate, batch size, momentum, and data augmentation, and shows a non-monotonic relationship between this noise and merging gains for both linear interpolation and task arithmetic. Across architectures, datasets, and transfer scenarios, larger learning rates and weight decays can enhance merging up to a sweet spot, while smaller batch sizes and augmentation further promote compatibility, with initialization playing a crucial role in task arithmetic. The findings illuminate how optimization shapes cross-solution compatibility, offering a pathway to engineer training dynamics that yield more mergeable models and informing future study of loss-landscape geometry in multi-solution fusion.

Abstract

Model merging methods combine models with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which linearly interpolates between model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization process affects the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of optimizer and data choices on model merging. Across architectures and datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale, exhibiting the same qualitative trend. Unlike prior work that connects optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its downstream consequences for model merging, suggesting the possibility of further manipulating the training dynamics to improve merging effectiveness.

How does the optimizer implicitly bias the model merging loss landscape?

TL;DR

This work investigates why independently trained models merge poorly or well by examining how optimization dynamics sculpt the merging loss landscape. It introduces the effective noise scale as a unifying factor that aggregates learning rate, batch size, momentum, and data augmentation, and shows a non-monotonic relationship between this noise and merging gains for both linear interpolation and task arithmetic. Across architectures, datasets, and transfer scenarios, larger learning rates and weight decays can enhance merging up to a sweet spot, while smaller batch sizes and augmentation further promote compatibility, with initialization playing a crucial role in task arithmetic. The findings illuminate how optimization shapes cross-solution compatibility, offering a pathway to engineer training dynamics that yield more mergeable models and informing future study of loss-landscape geometry in multi-solution fusion.

Abstract

Model merging methods combine models with different capabilities into a single one while maintaining the same inference cost. Two popular approaches are linear interpolation, which linearly interpolates between model weights, and task arithmetic, which combines task vectors obtained by the difference between finetuned and base models. While useful in practice, what properties make merging effective are poorly understood. This paper explores how the optimization process affects the loss landscape geometry and its impact on merging success. We show that a single quantity -- the effective noise scale -- unifies the impact of optimizer and data choices on model merging. Across architectures and datasets, the effectiveness of merging success is a non-monotonic function of effective noise, with a distinct optimum. Decomposing this quantity, we find that larger learning rates, stronger weight decay, smaller batch sizes, and data augmentation all independently modulate the effective noise scale, exhibiting the same qualitative trend. Unlike prior work that connects optimizer noise to the flatness or generalization of individual minima, we show that it also affects the global loss landscape, predicting when independently trained solutions can be merged. Our findings broaden the understanding of how optimization shapes the loss landscape geometry and its downstream consequences for model merging, suggesting the possibility of further manipulating the training dynamics to improve merging effectiveness.

Paper Structure

This paper contains 35 sections, 4 equations, 29 figures.

Figures (29)

  • Figure 1: Effective noise scale controls the effectiveness of merging. The y-axis reports the test accuracy gain of merged models. On the x-axis, when plotting (a) batch sizes against learning rates or (b) vice versa, there is no clear trend. When reparameterized in terms of (c) effective noise scale, the curves are aligned, highlighting the interaction between different components for merging.
  • Figure 2: Larger learning rate leads to more effective merging. (top) The test accuracy gain of all the models. (bottom) Each point represents a single model accuracy on the $x$-axis and its accuracy gain after merging on the $y$-axis. The opacity indicates the number of training epochs. For each setup, we observe that a larger learning rates have a higher accuracy gain, even when there is a smaller learning rate with equivalent single model accuracy. Note, however, solutions found using a "too large" learning rate fail to merge (details in \ref{['app:fail']}).
  • Figure 3: Weight decay has a similar effect as the learning rate. For CIFAR100 and TinyImagenet, we use scale-invariant networks (w/ normalization layers) and observe that a larger weight decay can not only improve the accuracy of the single model, but also the accuracy gain via the effective learning ratevan2017l2. For MLP trained on SVHN, there is no trend as the architecture is not scale-invariant.
  • Figure 4: Batch size
  • Figure 5: Data augmentation
  • ...and 24 more figures