Table of Contents
Fetching ...

Learning to Route Among Specialized Experts for Zero-Shot Generalization

Mohammed Muqeeth, Haokun Liu, Yufan Liu, Colin Raffel

TL;DR

PHATGOOSE introduces a post-hoc, tokenwise gating mechanism to route among independently trained specialized PEFT modules, enabling zero-shot generalization without sharing datasets. By training lightweight gates after freezing both the base model and the expert modules, and performing per-layer, per-token top-k routing, PHATGOOSE combines diverse expert capabilities to improve unseen task performance. Across T5-family experiments with two expert pools, PHATGOOSE consistently outperforms prior post-hoc baselines and can rival explicit multitask training, demonstrating the value of decentralized, adaptive expert recycling. The work also provides qualitative insight into diverse routing strategies and sets the stage for expanding decentralized collaboration in generalist AI development.

Abstract

Recently, there has been a widespread proliferation of "expert" language models that are specialized to a specific task or domain through parameter-efficient fine-tuning. How can we recycle large collections of expert language models to improve zero-shot generalization to unseen tasks? In this work, we propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), which learns to route among specialized modules that were produced through parameter-efficient fine-tuning. Unlike past methods that learn to route among specialized models, PHATGOOSE explores the possibility that zero-shot generalization will be improved if different experts can be adaptively chosen for each token and at each layer in the model. Crucially, our method is post-hoc - it does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained. In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). To better understand the routing strategy learned by PHATGOOSE, we perform qualitative experiments to validate that PHATGOOSE's performance stems from its ability to make adaptive per-token and per-module expert choices. We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.

Learning to Route Among Specialized Experts for Zero-Shot Generalization

TL;DR

PHATGOOSE introduces a post-hoc, tokenwise gating mechanism to route among independently trained specialized PEFT modules, enabling zero-shot generalization without sharing datasets. By training lightweight gates after freezing both the base model and the expert modules, and performing per-layer, per-token top-k routing, PHATGOOSE combines diverse expert capabilities to improve unseen task performance. Across T5-family experiments with two expert pools, PHATGOOSE consistently outperforms prior post-hoc baselines and can rival explicit multitask training, demonstrating the value of decentralized, adaptive expert recycling. The work also provides qualitative insight into diverse routing strategies and sets the stage for expanding decentralized collaboration in generalist AI development.

Abstract

Recently, there has been a widespread proliferation of "expert" language models that are specialized to a specific task or domain through parameter-efficient fine-tuning. How can we recycle large collections of expert language models to improve zero-shot generalization to unseen tasks? In this work, we propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE), which learns to route among specialized modules that were produced through parameter-efficient fine-tuning. Unlike past methods that learn to route among specialized models, PHATGOOSE explores the possibility that zero-shot generalization will be improved if different experts can be adaptively chosen for each token and at each layer in the model. Crucially, our method is post-hoc - it does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained. In experiments covering a range of specialized model collections and zero-shot generalization benchmarks, we find that PHATGOOSE outperforms past methods for post-hoc routing and, in some cases, outperforms explicit multitask training (which requires simultaneous data access). To better understand the routing strategy learned by PHATGOOSE, we perform qualitative experiments to validate that PHATGOOSE's performance stems from its ability to make adaptive per-token and per-module expert choices. We release all of our code to support future work on improving zero-shot generalization by recycling specialized experts.
Paper Structure (26 sections, 7 figures, 3 tables)

This paper contains 26 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Average performance of different multitask training and expert routing methods when using the same held-in and held-out tasks as T0 sanh2021multitask. Notably, our proposed method PHATGOOSE outperforms all past methods for recycling experts as well as explicit multitask training (which requires simultaneous data access) and nearly matches the performance of an oracle routing scheme. See \ref{['sec:experiments']} for more details. Exact numerical results for all methods can be found in \ref{['tab:T0_results_all_methods']}.
  • Figure 2: Visualization of how PHATGOOSE learns to route among specialized modules. This diagram shows how routing is learned at a layer where a module is inserted; typically a PEFT-based model introduces many such modules at various layers throughout the model. Left: After a specialized module (here, shown as a LoRA hu2021lora module) has been trained, it is frozen and a sigmoid gate is trained to choose which activations should be fed into the module. Right: During inference, a routing distribution (shown as a bar plot) is computed from the dot product scores between the normalized gates and an activation. Top-$k$ routing is then performed by choosing the modules according to this routing distribution.
  • Figure 3: Routing distributions produced by PHATGOOSE for Story Cloze and CB (from T0HO). The Oracle router's chosen module is highlighted by dashed lines. On Story Cloze, PHATGOOSE chooses the Oracle module in the encoder but uses diverse experts in the decoder but nevertheless matches Oracle performance. On CB, PHATGOOSE almost never uses the Oracle module and produces significantly better performance by using a wide range of modules.
  • Figure 4: BIG-bench Hard (BBH) results of different methods in T0 Held-In setting
  • Figure 5: BIG-bench Lite (BBL) results of different methods in T0 Held-In setting
  • ...and 2 more figures