Table of Contents
Fetching ...

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

TL;DR

A lightweight phase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories and allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise is proposed.

Abstract

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

TL;DR

A lightweight phase router that learns latent phase boundaries directly from the RL objective without pre-defining phase categories and allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise is proposed.

Abstract

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.
Paper Structure (33 sections, 15 equations, 5 figures, 12 tables, 1 algorithm)

This paper contains 33 sections, 15 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Single-Policy Networks vs. Our PA-MoE. (a) Single-policy networks exhibit severe simplicity bias: simple tasks (pick_and_place) occupy 75% of parameters while complex tasks (heat/cool/clean requiring multi-step tool interaction) receive only 5%. Parameter occupancy is measured as the fraction of training batches where each task category contributes $>$50% of the batch loss. (b) PA-MoE achieves balanced parameter allocation ($\sim$30% per expert) and uniformly high performance across all task difficulties (96%-92%-98%).
  • Figure 2: Routing Granularity Comparison. (a) Expert switches per episode across routing strategies. Token-level MoE causes excessive fragmentation (45 switches), while trajectory-level MoE is overly coarse (3 switches). PA-MoE strikes a balance with 8 switches per episode. (b) Visualization of phase-level expert assignment showing semantic coherence: the same expert (indicated by color) is maintained within each contiguous behavioral phase, with switches occurring only at phase boundaries.
  • Figure 3: PA-MoE Architecture.Upper panel: Phase-Aware Router (Sec. \ref{['sec:method']}) processes observation $o_t$ and goal $g$ via cross-attention, and action history $h_t$ via LSTM, to select expert $k^* = \arg\max p_i$. Balance loss $\mathcal{L}_{\text{bal}}$ ensures uniform expert utilization. Lower panel: Phase-Aware MoE Execution shows agent-environment interaction. The router selects expert $k^*$ from $K$ LoRA-based experts sharing a frozen base model (gray). The selected expert generates action via policy $\pi_{\text{exp}}^{k^*}(a_t | s_t, h_t, g)$. Training optimizes both router and experts jointly via RL loss $\mathcal{L}_{\text{RL}}$ and diversity loss $\mathcal{L}_{\text{div}}$ (Sec. \ref{['sec:method']}).
  • Figure 4: Gradient conflict analysis. (a) Phase-specific gradients projected onto top-2 principal components show pairwise conflicts: Explore and Interact gradients form angles exceeding 90°, and no two phases produce aligned gradients. (b) Gradient conflict score throughout training, defined as the average negative cosine similarity between phase gradients. Single policy maintains high conflict ($>0.4$) while PA-MoE reduces conflict to near-zero by epoch 50. (c) Pareto analysis of simple vs. complex task performance. PA-MoE achieves Pareto optimality while GiGPO baseline suffers degraded complex task performance.
  • Figure 5: Entropy mismatch analysis. All entropy values reported in bits (log base 2). (a) Policy entropy across phases: optimal phase-specific entropy (blue), single policy entropy (coral), and PA-MoE entropy (green). (b) Action distribution modality for exploration (diffuse, $H{=}3.5$ bits), interaction (peaked, $H{=}0.5$ bits), and single policy (intermediate, $H{=}2.3$ bits). (c) Entropy evolution over episode timesteps. (d) Absolute entropy deviation from optimal by phase.

Theorems & Definitions (1)

  • Definition 3.1: Behavioral Phase