Realistic Evaluation of Model Merging for Compositional Generalization
Derek Tam, Yash Kant, Brian Lester, Igor Gilitschenski, Colin Raffel
TL;DR
This paper tackles the problem of evaluating how to merge pretrained models to achieve compositional generalization across tasks and modalities. It introduces a rigorous, shared evaluation framework that benchmarks eight merging methods on cross-domain image classification/generation and cross-lingual NLP, examining held-in versus generalization performance, prerequisites, compute costs, and scaling with the number of merged models. Key findings reveal that performance trends are domain-dependent (vision vs NLP), with held-in performance correlating with generalization in vision but anticorrelated in NLP; increasing the number of merged models tends to hurt held-in accuracy but improve generalization, and certain methods like TIES offer favorable trade-offs. The work provides actionable guidance and released code to standardize future evaluations, helping to advance understanding and development of model merging for compositional generalization in real-world settings.
Abstract
Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.
