Table of Contents
Fetching ...

Mixture-of-Experts Meets In-Context Reinforcement Learning

Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang

TL;DR

This paper tackles two core bottlenecks in in-context reinforcement learning: the multi-modality of state-action-reward prompts and the broad, heterogeneous task distribution. It introduces T2MIR, a simple yet scalable architectural enhancement that adds two parallel mixture-of-experts layers—token-wise to process modality-specific token semantics and task-wise to specialize routing across tasks—within transformer decision models. A contrastive learning objective jointly improves task routing by aligning router representations with task identity, while regularization ensures balanced expert usage. Empirical results across diverse offline multi-task benchmarks show that T2MIR consistently improves in-context learning speed and final performance, and exhibits robustness to data quality and task distribution, establishing MoE as a promising direction for scalable ICRL. The work provides a practical path toward leveraging MoE gains in RL, with code available for reproducibility and further extensions to more complex vision-language-action settings.

Abstract

In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.

Mixture-of-Experts Meets In-Context Reinforcement Learning

TL;DR

This paper tackles two core bottlenecks in in-context reinforcement learning: the multi-modality of state-action-reward prompts and the broad, heterogeneous task distribution. It introduces T2MIR, a simple yet scalable architectural enhancement that adds two parallel mixture-of-experts layers—token-wise to process modality-specific token semantics and task-wise to specialize routing across tasks—within transformer decision models. A contrastive learning objective jointly improves task routing by aligning router representations with task identity, while regularization ensures balanced expert usage. Empirical results across diverse offline multi-task benchmarks show that T2MIR consistently improves in-context learning speed and final performance, and exhibits robustness to data quality and task distribution, establishing MoE as a promising direction for scalable ICRL. The work provides a practical path toward leveraging MoE gains in RL, with code available for reproducibility and further extensions to more complex vision-language-action settings.

Abstract

In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.

Paper Structure

This paper contains 33 sections, 3 theorems, 22 equations, 13 figures, 17 tables, 4 algorithms.

Key Result

Theorem 1

Let $\mathcal{M}$ denote a set of tasks following the task distribution $P(M)$, and $|\mathcal{M}|\!=\!N$. $M\!\in\! \mathcal{M}$ is a given task. Let $\bar{h}\!=\!f(\tau)$, $z\!\sim\! G_{\text{task}}(\cdot|\bar{h})$, and $e(\bar{h},z)\!=\!\frac{p(z|\bar{h})}{p(z)}$, where $\tau$ is a trajectory fro

Figures (13)

  • Figure 1: t-SNE visualization of expert assignments on Cheetah-Vel where tasks differ in target velocities. Left: token-wise MoE enables different experts to process tokens with distinct semantics. Right: task-wise MoE effectively manages a broad task distribution, where the difference between expert assignments is positively related to the difference between tasks.
  • Figure 2: The overview of T2MIR. (a) Overall pipeline: we substitute the feedforward layer in causal transformer blocks with two parallel MoE layers and concatenate their outputs to feed into the next layer. (b) Token-wise MoE: it automatically captures distinct semantic features within the multi-modal $(s,a,r)$ inputs, and uses $\mathcal{L}_\text{balance}$ as regularization loss to avoid tokens from all modalities collapsing onto minority experts. (c) Task-wise MoE: it assigns diverse tasks to specialized experts, and includes a contrastive learning loss $\mathcal{L}_\text{contrastive}$ to enhance task-wise routing via more precise capture of task-relevant information, where $\tau_i$ is the query and $\tau_{i^*} / \tau_{i'}$ are positive/negative keys.
  • Figure 3: Test return curves of two T2MIR implementations against baselines using Mixed datasets.
  • Figure 4: Ablation results of both T2MIR-AD and T2MIR-DPT architectures using Mixed datasets.
  • Figure 5: Test return curves of T2MIR against baselines using Medium-Expert and Medium datasets.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof