Scalable Model Merging with Progressive Layer-wise Distillation
Jing Xu, Jiazheng Li, Jingzhao Zhang
TL;DR
This work addresses the challenge of merging multiple fine-tuned models without extensive retraining. It proves that data-agnostic merging can perform poorly in worst-case scenarios, underscoring the need for domain-specific data. The authors introduce ProDistill, a progressive layer-wise distillation method that uses task vectors and activation matching to incrementally merge models with low memory overhead, enabling scaling to models beyond 10B parameters. Empirically, ProDistill achieves state-of-the-art gains across vision, NLU, and LLM tasks, while demonstrating strong data, computation, and memory efficiency. The approach offers a practical path to integrating diverse capabilities in large, deployed models with reduced training costs and energy consumption.
Abstract
Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performance. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods, ProDistill achieves state-of-the-art performance, with up to 6.14% and 6.61% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.
