Table of Contents
Fetching ...

Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

Zhenyi Lu, Chenghao Fan, Wei Wei, Xiaoye Qu, Dangyang Chen, Yu Cheng

TL;DR

Twin-Merging tackles interference and data heterogeneity in model merging by explicitly separating shared and exclusive task knowledge, compressing the exclusive parts with SVD, and dynamically merging them via a small input-conditioned router. This modularization plus dynamic fusion significantly narrows the gap to fine-tuning across NLP and vision benchmarks, achieving large gains on discriminative tasks and even surpassing fine-tuned bounds on generative tasks, while dramatically reducing parameter storage. The approach is scalable to large models (e.g., 72B) and many tasks, remains robust under unseen data, and remains compatible with existing merging methods. Overall, Twin-Merging offers a practical, scalable, and storage-efficient path to multi-task models without retraining experts, with strong potential for deployment in diverse, data-shifting environments.

Abstract

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on $20$ datasets for both language and vision tasks demonstrate the effectiveness of our method, showing an average improvement of $28.34\%$ in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. Our implementation is available in \url{https://github.com/LZY-the-boys/Twin-Merging}

Twin-Merging: Dynamic Integration of Modular Expertise in Model Merging

TL;DR

Twin-Merging tackles interference and data heterogeneity in model merging by explicitly separating shared and exclusive task knowledge, compressing the exclusive parts with SVD, and dynamically merging them via a small input-conditioned router. This modularization plus dynamic fusion significantly narrows the gap to fine-tuning across NLP and vision benchmarks, achieving large gains on discriminative tasks and even surpassing fine-tuned bounds on generative tasks, while dramatically reducing parameter storage. The approach is scalable to large models (e.g., 72B) and many tasks, remains robust under unseen data, and remains compatible with existing merging methods. Overall, Twin-Merging offers a practical, scalable, and storage-efficient path to multi-task models without retraining experts, with strong potential for deployment in diverse, data-shifting environments.

Abstract

In the era of large language models, model merging is a promising way to combine multiple task-specific models into a single multitask model without extra training. However, two challenges remain: (a) interference between different models and (b) heterogeneous data during testing. Traditional model merging methods often show significant performance gaps compared to fine-tuned models due to these issues. Additionally, a one-size-fits-all model lacks flexibility for diverse test data, leading to performance degradation. We show that both shared and exclusive task-specific knowledge are crucial for merging performance, but directly merging exclusive knowledge hinders overall performance. In view of this, we propose Twin-Merging, a method that encompasses two principal stages: (1) modularizing knowledge into shared and exclusive components, with compression to reduce redundancy and enhance efficiency; (2) dynamically merging shared and task-specific knowledge based on the input. This approach narrows the performance gap between merged and fine-tuned models and improves adaptability to heterogeneous data. Extensive experiments on datasets for both language and vision tasks demonstrate the effectiveness of our method, showing an average improvement of in absolute normalized score for discriminative tasks and even surpassing the fine-tuned upper bound on the generative tasks. Our implementation is available in \url{https://github.com/LZY-the-boys/Twin-Merging}
Paper Structure (49 sections, 5 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 49 sections, 5 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: Subfigure (I) shows that in conventional merging methods, parameters from different task-specific models and a pre-trained model are weighted-summed into a single multitask model for inference. Subfigure (II) illustrates that our Twin-Merging method first isolates shared knowledge, then extracts exclusive knowledge by identifying differences between task experts and the shared model. This exclusive knowledge is then compressed into sparse vectors. Subfigure (III) shows that during testing, Twin-Merging dynamically merges shared and compressed specialized knowledge based on test inputs to form the final inference model.
  • Figure 1: Merging without parameter interference and merging between similar tasks both cause performance degradation (Notice: these two experiments use different datasets).
  • Figure 2: The effectiveness of Twin-Merging in terms of performance and parameter-efficiency.
  • Figure 3: The impact of different ratios of shared knowledge and exclusive knowledge.
  • Figure 4: Averaged normalized accuracy vs. the number of tasks for various benchmarks. Twin-Merging maintains performance regardless of task number and compresses the fine-tuned checkpoints.
  • ...and 4 more figures