Table of Contents
Fetching ...

A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds

Christopher Z. Cui, Xiangyu Peng, Mark O. Riedl

TL;DR

This work tackles task transfer in open-ended, text-based worlds by introducing a Mixture-of-Experts (MoE) framework that combines multiple frozen, role-aligned KG-A2C experts with a single trainable, open-ended expert. An attention mechanism blends action logits across experts, with value seeding guiding attention toward promising experts, and a dedicated hot expert fills policy gaps for novel tasks. The approach is evaluated in the LIGHT environment with role-based blends and partial blends, showing improved zero-shot rewards and faster few-shot learning compared to baselines, while demonstrating robustness to distractor experts. The findings suggest that maintaining frozen, specialized policies and learning to attend to the most relevant ones, plus a minimal trainable component, can significantly accelerate adaptation in complex, open-ended RL settings. This has implications for building flexible, scalable agents capable of rapidly acquiring new behaviors with limited interaction data.

Abstract

Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.

A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds

TL;DR

This work tackles task transfer in open-ended, text-based worlds by introducing a Mixture-of-Experts (MoE) framework that combines multiple frozen, role-aligned KG-A2C experts with a single trainable, open-ended expert. An attention mechanism blends action logits across experts, with value seeding guiding attention toward promising experts, and a dedicated hot expert fills policy gaps for novel tasks. The approach is evaluated in the LIGHT environment with role-based blends and partial blends, showing improved zero-shot rewards and faster few-shot learning compared to baselines, while demonstrating robustness to distractor experts. The findings suggest that maintaining frozen, specialized policies and learning to attend to the most relevant ones, plus a minimal trainable component, can significantly accelerate adaptation in complex, open-ended RL settings. This has implications for building flexible, scalable agents capable of rapidly acquiring new behaviors with limited interaction data.

Abstract

Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
Paper Structure (40 sections, 5 equations, 3 figures, 1 table)

This paper contains 40 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Pipeline of our MoE Agent. At each time-step, all experts produces a logit distribution over actions, $\mathbf{a}_1, \mathbf{a}_2, ..., \mathbf{a}_N$. These are each passed through a softmax to get the resulting probabilities and multiplied by the $\mathbf{v}_1, \mathbf{v}_2, ..., \mathbf{v}_N$ produced by each expert's critic module. The scaled probabilities are then mixed by the attention module and averaged (operation represented by $\mu$) with the $\mathbf{a}_N$ produced by the trainable expert to obtain $\mathbf{l_t}$. These averaged logits are then passed through a softmax and sampled to produce an action.
  • Figure 2: The full agent versus unfrozen experts allowed to continue training and a new KG-A2C. The left graph illustrates the test time performance on a persona composed only of pre-existing expert behaviors while the right illustrates the performance on a persona that is a mix of both pre-existing and new behaviors.
  • Figure 3: The testing-time performance of the MoE agent with only the original four experts versus the same agent with four added random experts. When all experts contribute some relevant information to the new task, the MoE agent's performance suffers slightly as the attention module needs more time to distinguish relevant experts from irrelevant experts. When there are only a few relevant experts, additional irrelevant experts have little to no impact on the MoE agent's performance.