Table of Contents
Fetching ...

Collective Model Intelligence Requires Compatible Specialization

Jyothish Pari, Samy Jelassi, Pulkit Agrawal

TL;DR

This work investigates routing-based merging strategies, which offer more flexible methods for combining specialized models by dynamically routing across different layers, and proposes a new direction for achieving collective model intelligence through what is called compatible specialization.

Abstract

In this work, we explore the limitations of combining models by averaging intermediate features, referred to as model merging, and propose a new direction for achieving collective model intelligence through what we call compatible specialization. Current methods for model merging, such as parameter and feature averaging, struggle to effectively combine specialized models due to representational divergence during fine-tuning. As models specialize to their individual domains, their internal feature representations become increasingly incompatible, leading to poor performance when attempting to merge them for new tasks. We analyze this phenomenon using centered kernel alignment (CKA) and show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use. To address these challenges, we investigate routing-based merging strategies, which offer more flexible methods for combining specialized models by dynamically routing across different layers. This allows us to improve on existing methods by combining features from multiple layers rather than relying on fixed, layer-wise combinations. However, we find that these approaches still face limitations when layers within models are representationally incompatible. Our findings highlight the importance of designing new approaches for model merging that operate on well-defined input and output spaces, similar to how humans communicate through language rather than intermediate neural activations.

Collective Model Intelligence Requires Compatible Specialization

TL;DR

This work investigates routing-based merging strategies, which offer more flexible methods for combining specialized models by dynamically routing across different layers, and proposes a new direction for achieving collective model intelligence through what is called compatible specialization.

Abstract

In this work, we explore the limitations of combining models by averaging intermediate features, referred to as model merging, and propose a new direction for achieving collective model intelligence through what we call compatible specialization. Current methods for model merging, such as parameter and feature averaging, struggle to effectively combine specialized models due to representational divergence during fine-tuning. As models specialize to their individual domains, their internal feature representations become increasingly incompatible, leading to poor performance when attempting to merge them for new tasks. We analyze this phenomenon using centered kernel alignment (CKA) and show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use. To address these challenges, we investigate routing-based merging strategies, which offer more flexible methods for combining specialized models by dynamically routing across different layers. This allows us to improve on existing methods by combining features from multiple layers rather than relying on fixed, layer-wise combinations. However, we find that these approaches still face limitations when layers within models are representationally incompatible. Our findings highlight the importance of designing new approaches for model merging that operate on well-defined input and output spaces, similar to how humans communicate through language rather than intermediate neural activations.

Paper Structure

This paper contains 20 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: In the left sub-figure, we illustrate how fine-tuning a base model separately on two different tasks initially improves merging performance, but after a critical point, further fine-tuning leads to a decrease in performance. In the right sub-figure, we examine how merging features across different layers in the two models affects merging performance. The diagonal teal color represents cases where layers of the same index/position are merged. As we attempt to merge layers that are progressively further apart in the network, performance begins to plateau. We hypothesize that both phenomena stem from a lack of compatible specialization, where models need to maintain compatibility to be effective in collective use.
  • Figure 2: We illustrate the general routing based merging pipeline as follows. First, the MLP layers are finetuned from a base model on specialized datasets. Once we obtain a set of specialized models, we construct a Mixture of Experts (MoE) where the experts are the finetuned MLP layers from the various models. Finally, on a novel adaptation dataset, we train only the router.
  • Figure 3: (Left) Validation cross-entropy loss (CE Loss) for math and coding models during finetuning, as well as the merged models. The math and coding models exhibit steady decreases in validation loss as they specialize on their respective tasks. In contrast, the validation loss of the merged model via activation interpolation on a cross-domain task requiring both math and coding decrease quickly and increase gradually after a critical point. (Middle) Merging loss plotted against CKA similarity computed on data from the adaptation dataset. (Right) Merging loss plotted against CKA similarity computed on data from the pretraining dataset.
  • Figure 4: Performance comparison of various model merging techniques for In-Domain and Cross-Domain tasks. The plot shows the progression of different merging methods, from simple interpolation strategies, (SLERP, LERP, activation interpolation) see \ref{['apdx: interpolation']}, to more complex ones involving routers (Single Router, Full Router, Routing with Base Model). The trend demonstrates that increasing the complexity and capacity for model merging results in performance gains, as reflected by the lower adaptation loss.
  • Figure 5: Comparison three routing strategies in model merging: Standard, 2-Layer, and 3-Layer Routing (see Figure \ref{['fig:router_viz']}). Evaluated on two in-domain tasks and one cross-domain task, results show that increased routing complexity reduces CE loss across all tasks. 2-Layer Routing achieves notable gains over standard routing, with 3-Layer Routing offering further, minor improvements.
  • ...and 3 more figures