Table of Contents
Fetching ...

Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts

Sanae Lotfi, Lucas Caccia, Alessandro Sordoni, Jordan T. Ash, Miroslav Dudik

TL;DR

It is indicated that non-uniform ensembling and merging improve performance, but routing offers even greater gains, and to mitigate the computational cost of routing, expert selection techniques are analyzed, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead.

Abstract

While large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks, their performance on individual tasks depends on the fine-tuning strategy. Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies: ensembling, which combines outputs from independent models; merging, which fuses model weights via parameter averaging; and routing, which integrates models in an input-dependent fashion. However, many design decisions in these approaches remain understudied, and the relative benefits of more sophisticated ensembling, merging and routing techniques are not fully understood. We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? And does the flexibility of routing justify its complexity? Our findings indicate that non-uniform ensembling and merging improve performance, but routing offers even greater gains. To mitigate the computational cost of routing, we analyze expert selection techniques, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead. These insights advance our understanding of model fusion for multi-task learning.

Trade-offs in Ensembling, Merging and Routing Among Parameter-Efficient Experts

TL;DR

It is indicated that non-uniform ensembling and merging improve performance, but routing offers even greater gains, and to mitigate the computational cost of routing, expert selection techniques are analyzed, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead.

Abstract

While large language models (LLMs) fine-tuned with lightweight adapters achieve strong performance across diverse tasks, their performance on individual tasks depends on the fine-tuning strategy. Fusing independently trained models with different strengths has shown promise for multi-task learning through three main strategies: ensembling, which combines outputs from independent models; merging, which fuses model weights via parameter averaging; and routing, which integrates models in an input-dependent fashion. However, many design decisions in these approaches remain understudied, and the relative benefits of more sophisticated ensembling, merging and routing techniques are not fully understood. We empirically evaluate their trade-offs, addressing two key questions: What are the advantages of going beyond uniform ensembling or merging? And does the flexibility of routing justify its complexity? Our findings indicate that non-uniform ensembling and merging improve performance, but routing offers even greater gains. To mitigate the computational cost of routing, we analyze expert selection techniques, showing that clustering and greedy subset selection can maintain reasonable performance with minimal overhead. These insights advance our understanding of model fusion for multi-task learning.
Paper Structure (30 sections, 8 equations, 7 figures, 2 tables)

This paper contains 30 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Model fusion of parameter-efficient experts for task-agnostic multi-task learning.(a): We use a publicly available library of LoRA experts, each fine-tuned independently on Flan v2 tasks from the same Phi-2 pretrained LLM ostapenko2024towards. These experts provide a diverse foundation for multi-task learning. (b): Comparison of three model fusion approaches and their trade-offs. Ensembling aggregates outputs from the independent experts at inference. Merging fuses expert weights into a single model via parameter averaging. Routing extends merging by making the weight combination input-dependent, adapting to each input dynamically.
  • Figure 2: Performance of different model fusion approaches for multi-task learning. We evaluate the average multi-task test loss across $256$ Flan v2 tasks for various ensembling, merging, and routing strategies, reporting the standard error across all tasks. Ensembling methods include uniform output averaging, as well as learned ensembling coefficients through stochastic gradient descent (SGD) optimization and knowledge distillation into a single model. Merging strategies involve parameter-space fusion via uniform averaging and SGD optimization of the fusion coefficients, where the learned coefficients can be either globally shared across all layers (Global SGD) or layer-specific. Routing strategies include layer-dependent routing optimization via SGD, hierarchical clustering (HC), and an optimized version of HC initialized with Arrow weights (Arrow HC). Our results indicate that ensembling outperforms merging, which may reflect limitations of the mode connectivity assumption. Routing, however, delivers the best performance among non-oracle methods. Notably, end-to-end SGD optimization consistently yields the best results across all model fusion approaches.
  • Figure 3: Mode connectivity analysis in the multi-task setting. For each subplot, we interpolate between two experts $(A_1, B_1)$ and $(A_2, B_2)$, independently fine-tuned on separate tasks $T_1$ and $T_2$ from the Flan v2 dataset longpre2023flan. We evaluate the performance of the interpolated model $(A_\alpha = (1 - \alpha) A_1 + \alpha A_2, B_\alpha = (1 - \alpha) B_1 + \alpha B_2)$ on the combined datasets that contains both tasks $T_1+T_2$, with $\alpha \in [0,1]$ shown on the x-axis. Let $\mathcal{M}_{O_1}$ and $\mathcal{M}_{O_2}$ represent the oracle experts that are best for tasks $T_1$ and $T_2$, respectively. Then the solid pink line represents the average performance of $\mathcal{M}_{O_1}$ and $\mathcal{M}_{O_2}$ on the combined dataset, where we use the best expert for each input. Our results demonstrate that careful expert selection outperforms linearly merging experts on multiple pairs of tasks.
  • Figure 4: Performance comparison between private and MBC experts across different model fusion approaches.
  • Figure 5: Comparison of multi-task test loss across different ensembling and merging strategies. Ensembling probabilities outperforms ensembling logits, confirming that averaging probabilities leads to better calibration. Similarly, full-rank merging slightly outperforms low-rank merging. Error bars indicate standard error over tasks.
  • ...and 2 more figures