A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds
Christopher Z. Cui, Xiangyu Peng, Mark O. Riedl
TL;DR
This work tackles task transfer in open-ended, text-based worlds by introducing a Mixture-of-Experts (MoE) framework that combines multiple frozen, role-aligned KG-A2C experts with a single trainable, open-ended expert. An attention mechanism blends action logits across experts, with value seeding guiding attention toward promising experts, and a dedicated hot expert fills policy gaps for novel tasks. The approach is evaluated in the LIGHT environment with role-based blends and partial blends, showing improved zero-shot rewards and faster few-shot learning compared to baselines, while demonstrating robustness to distractor experts. The findings suggest that maintaining frozen, specialized policies and learning to attend to the most relevant ones, plus a minimal trainable component, can significantly accelerate adaptation in complex, open-ended RL settings. This has implications for building flexible, scalable agents capable of rapidly acquiring new behaviors with limited interaction data.
Abstract
Open-ended worlds are those in which there are no pre-specified goals or environmental reward signal. As a consequence, an agent must know how to perform a multitude of tasks. However, when a new task is presented to an agent, we expect it to be able to reuse some of what it knows from previous tasks to rapidly learn that new task. We introduce a novel technique whereby policies for different a priori known tasks are combined into a Mixture-of-Experts model with an attention mechanism across a mix of frozen and unfrozen experts. The model learns when to attend to frozen task-specific experts when appropriate and learns new experts to handle novel situations. We work in an open-ended text-based environment in which the agent is tasked with behaving like different types of character roles and must rapidly learn behaviors associated with new character role types. We show that our agent both obtains more rewards in the zero-shot setting, and discovers these rewards with greater sample efficiency in the few-shot learning settings.
