Table of Contents
Fetching ...

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni

TL;DR

<3-5 sentence high-level summary>

Abstract

The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.

A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

TL;DR

<3-5 sentence high-level summary>

Abstract

The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.
Paper Structure (57 sections, 5 figures)

This paper contains 57 sections, 5 figures.

Figures (5)

  • Figure 1: Different levels of granularity for routing decisions in MoErging methods. Left: Task Level Routing selects a single expert for all examples belonging to a specific task. Middle: Example Level Routing chooses an expert independently for each input example. Right: Step/Token Level Routing makes a routing decision (i.e., selects an expert) at each processing step or for each generated token. The purple elements indicate the input used by the router to make its decisions.
  • Figure 2: Different levels of granularity for routing depth in MoErging methods. Left: Model Level Routing applies a single routing decision to select experts that are then used across all applicable modules or layers of the model. Right: Module Level Routing makes independent routing decisions at each layer or module where experts are integrated, allowing for different experts to be active at different depths. The turquoise boxes represent the routers operating at either the model level or the individual module level.
  • Figure 3: Different strategies for expert selection in MoErging methods. Left: Dense Selection utilizes the output of all available experts, often through a weighted combination. Right: Sparse Selection activates only a subset of the experts (e.g., the top-k most relevant ones) based on the router's decision. The router's output distribution indicates the selection strategy, with "All" implying dense selection and "top-1" implying sparse selection of the single most relevant expert.
  • Figure 4: Different methods for expert aggregation in MoErging methods. Left: Output Aggregation combines the outputs of multiple selected experts, often using weights determined by the router. Right: Parameter Aggregation merges the parameters of multiple selected experts into a single, aggregated expert model before processing the input.
  • Figure 5: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections for specific papers that make some particular design choice. We omit references to methods for which a given choice is not applicable.