Table of Contents
Fetching ...

Rethinking Layer-wise Model Merging through Chain of Merges

Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara

TL;DR

This work tackles the problem of merging multiple task-specific checkpoints without retraining by identifying a fundamental issue: merging covariate shift (MCS), caused by ignoring inter-layer dependencies when adjusting early-layer parameters. It proposes Chain of Merges (CoM), a recursive layer-wise merging framework that updates activation statistics after each merge, effectively turning parameter merging into a layer-wise distillation over progressively merged activations. By incorporating Gram-based, cosine-normalized activations and task importance weights, CoM achieves a closed-form, importance-weighted merging rule that mitigates distributional shifts and preserves downstream compatibility. Empirical results on vision and language benchmarks show CoM delivering state-of-the-art performance, robust across architectures and tasks, with strong data efficiency and practical applicability to diverse deployment scenarios.

Abstract

Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.

Rethinking Layer-wise Model Merging through Chain of Merges

TL;DR

This work tackles the problem of merging multiple task-specific checkpoints without retraining by identifying a fundamental issue: merging covariate shift (MCS), caused by ignoring inter-layer dependencies when adjusting early-layer parameters. It proposes Chain of Merges (CoM), a recursive layer-wise merging framework that updates activation statistics after each merge, effectively turning parameter merging into a layer-wise distillation over progressively merged activations. By incorporating Gram-based, cosine-normalized activations and task importance weights, CoM achieves a closed-form, importance-weighted merging rule that mitigates distributional shifts and preserves downstream compatibility. Empirical results on vision and language benchmarks show CoM delivering state-of-the-art performance, robust across architectures and tasks, with strong data efficiency and practical applicability to diverse deployment scenarios.

Abstract

Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized models increases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques operate at the level of individual layers, thereby overlooking the inter-layer dependencies inherent in deep networks. We show that this simplification leads to distributional mismatches, particularly in methods that rely on intermediate activations, as changes in early layers are not properly propagated to downstream layers during merging. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address this, we propose Chain of Merges (CoM), a layer-wise merging procedure that sequentially merges weights across layers while sequentially updating activation statistics. By explicitly accounting for inter-layer interactions, CoM mitigates covariate shift and produces a coherent merged model through a series of conditionally optimal updates. Experiments on standard benchmarks demonstrate that CoM achieves state-of-the-art performance.

Paper Structure

This paper contains 34 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Merging covariate shift across layers.
  • Figure 2: Task importance vs. similarity.
  • Figure 3: Merging covariate shift across layers.
  • Figure 4: Task importance vs. similarity.
  • Figure 5: Comparison of merging methods across computational and memory cost.
  • ...and 1 more figures