Table of Contents
Fetching ...

Realistic Evaluation of Model Merging for Compositional Generalization

Derek Tam, Yash Kant, Brian Lester, Igor Gilitschenski, Colin Raffel

TL;DR

This paper tackles the problem of evaluating how to merge pretrained models to achieve compositional generalization across tasks and modalities. It introduces a rigorous, shared evaluation framework that benchmarks eight merging methods on cross-domain image classification/generation and cross-lingual NLP, examining held-in versus generalization performance, prerequisites, compute costs, and scaling with the number of merged models. Key findings reveal that performance trends are domain-dependent (vision vs NLP), with held-in performance correlating with generalization in vision but anticorrelated in NLP; increasing the number of merged models tends to hurt held-in accuracy but improve generalization, and certain methods like TIES offer favorable trade-offs. The work provides actionable guidance and released code to standardize future evaluations, helping to advance understanding and development of model merging for compositional generalization in real-world settings.

Abstract

Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.

Realistic Evaluation of Model Merging for Compositional Generalization

TL;DR

This paper tackles the problem of evaluating how to merge pretrained models to achieve compositional generalization across tasks and modalities. It introduces a rigorous, shared evaluation framework that benchmarks eight merging methods on cross-domain image classification/generation and cross-lingual NLP, examining held-in versus generalization performance, prerequisites, compute costs, and scaling with the number of merged models. Key findings reveal that performance trends are domain-dependent (vision vs NLP), with held-in performance correlating with generalization in vision but anticorrelated in NLP; increasing the number of merged models tends to hurt held-in accuracy but improve generalization, and certain methods like TIES offer favorable trade-offs. The work provides actionable guidance and released code to standardize future evaluations, helping to advance understanding and development of model merging for compositional generalization in real-world settings.

Abstract

Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.
Paper Structure (30 sections, 8 figures, 22 tables)

This paper contains 30 sections, 8 figures, 22 tables.

Figures (8)

  • Figure 1: Tasks that our image classification and generation models are trained on. Each row denotes objects within a certain category (e.g., fruit, bird, and tool) and the columns denotes different domains (e.g., sketch, real, and clipart). Each (category, domain) pair forms a different task---for example, the "fruit sketch" task involves generating or classifying sketches of fruits (i.e., apples, bananas, etc). Each constituent model is trained on one of the held-in tasks along the diagonal (solid border). Compositional generalization is measured via the performance on the generalization tasks off of the diagonal (dashed border).
  • Figure 2: Performance of different merging methods in the image classification, image generation, and natural language processing settings described in \ref{['sec:setup']}. For each method, we plot the performance on the held-in datasets against the performance on unseen datasets that require compositional generalization. Additionally, we report the performance of the pretrained model, a multitask model trained on all held-in datasets at once, and the performance attained by training on a single task's data alone. Numerical values are provided in \ref{['app:numerical-values']}.
  • Figure 3: The computational cost vs. performance for each merging method. For the computational cost, we report the upper bound of the number of FLOPs required to merge a single layer (see \ref{['app:hyperparameter-details']} for details).
  • Figure 4: Hyperparameter sensitivity of each merging method. We plot the performance of each merging method as we sweep their respective hyperparameters. We index possible hyperparameter values from $0$ to $10$ as the specific hyperparameters and their ranges differ between merging methods. This captures the robustness of merging methods to different hyperparameters, regardless of the specific values. See \ref{['app:hyperparameter-details']} for a description of the hyperparameters.
  • Figure 5: Performance of merging methods as the number of constituent tasks increases. Along the x-axis, we sample a subset of tasks $10$ times and report the mean held-in and generalization performance. We additionally evaluate a pretrained model and a multitask model trained on all the held-in tasks on the sampled subsets. Since the generalization datasets and the pretrained model are fixed, its generalization performance is shown as horizontal line.
  • ...and 3 more figures