Table of Contents
Fetching ...

Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu, Jiazheng Li, Jingzhao Zhang

TL;DR

This work addresses the challenge of merging multiple fine-tuned models without extensive retraining. It proves that data-agnostic merging can perform poorly in worst-case scenarios, underscoring the need for domain-specific data. The authors introduce ProDistill, a progressive layer-wise distillation method that uses task vectors and activation matching to incrementally merge models with low memory overhead, enabling scaling to models beyond 10B parameters. Empirically, ProDistill achieves state-of-the-art gains across vision, NLU, and LLM tasks, while demonstrating strong data, computation, and memory efficiency. The approach offers a practical path to integrating diverse capabilities in large, deployed models with reduced training costs and energy consumption.

Abstract

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Scalable Model Merging with Progressive Layer-wise Distillation

TL;DR

This work addresses the challenge of merging multiple fine-tuned models without extensive retraining. It proves that data-agnostic merging can perform poorly in worst-case scenarios, underscoring the need for domain-specific data. The authors introduce ProDistill, a progressive layer-wise distillation method that uses task vectors and activation matching to incrementally merge models with low memory overhead, enabling scaling to models beyond 10B parameters. Empirically, ProDistill achieves state-of-the-art gains across vision, NLU, and LLM tasks, while demonstrating strong data, computation, and memory efficiency. The approach offers a practical path to integrating diverse capabilities in large, deployed models with reduced training costs and energy consumption.

Abstract

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Paper Structure

This paper contains 41 sections, 4 theorems, 28 equations, 13 figures, 10 tables, 1 algorithm.

Key Result

Theorem 3.1

There exist a task and loss function $\ell$, such that for any data-agnostic model merging algorithm $\mathcal{M}$, any pair of models $f_1\neq f_2$, and any $\varepsilon, C>0$, there exists two datasets $\mathcal{D}_1, \mathcal{D}_2$, such that $f_1, f_2$ have a near-zero loss on $\mathcal{D}_1$ an but the merged model $\hat{f}=\mathcal{M}(f_1, f_2)$ has a constant loss on $\mathcal{D}_1\cup \mat

Figures (13)

  • Figure 1: ProDistill consistently outperforms other methods across nearly all considered tasks. The performance metrics for each task are normalized and then clipped at a minimum value of 0.5 for better visualization.
  • Figure 2: Left: Overview of model merging. Each expert corresponds to a task vector $\boldsymbol{\theta}_i-\boldsymbol{\theta}_0$, which is scaled by its corresponding merging coefficient $\boldsymbol{\lambda}_i$ and summed to get the merged model. Right: Illustration of ProDistill. The merged model layer and each fine-tuned model layer take as input the merged feature and the fine-tuned feature, respectively. The MSE loss between these outputs is used to update the merged model layer. The output features serve as inputs for merging the subsequent layer.
  • Figure 3: The t-SNE visualization of ViT-B-32 model trained by different merging algorithms, on the SVHN dataset. The features given by ProDistill are the most separated, resembling those of fine-tuned models.
  • Figure 4: Analysis of Data, Computation and Memory Efficiency.Left: The average accuracy of ProDistill and AdaMerging across 8 vision tasks, with different data availability. Our method demonstrates superior data efficiency. Middle: The average accuracy of ProDistill with different training epochs. Our algorithm achieves a fast convergence. Right: The training GPU memory cost of ProDistill, its unoptimized counterpart DistillMerge and AdaMerging. Our method has a significantly smaller memory footprint.
  • Figure 5: Comparison between ProDistill and DistillMerge. Left: Accuracy results on 8 vision benchmarks using ViT-B-32. Right: Performance metrics on the NLU tasks using RoBERTa. The results demonstrate the performance improvement of progressive training in ProDistill, compared to end-to-end training in DistillMerge, despite the latter being more resource-intensive.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Theorem 3.1
  • Theorem 3.2
  • Remark 3.3
  • Theorem 1.1
  • proof
  • Theorem 1.1
  • proof
  • Remark 1.1