Table of Contents
Fetching ...

Meta-RL Induces Exploration in Language Agents

Yulun Jiang, Liangze Jiang, Damien Teney, Michael Moor, Maria Brbic

TL;DR

This work tackles the challenge that RL-trained LLM agents struggle to explore effectively in multi-turn tasks. It introduces LaMer, a Meta-RL framework enabling cross-episode exploration and in-context policy adaptation via self-reflection, eliminating the need for gradient updates during test-time adaptation. Across Sokoban, MineSweeper, and Webshop, LaMer yields consistent improvements over prompting and RL baselines, with strong test-time scaling and better generalization to harder or unseen tasks. The results demonstrate that meta-learning can imbue language agents with robust exploration strategies, improving adaptation in novel environments and long-horizon decision-making tasks.

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

Meta-RL Induces Exploration in Language Agents

TL;DR

This work tackles the challenge that RL-trained LLM agents struggle to explore effectively in multi-turn tasks. It introduces LaMer, a Meta-RL framework enabling cross-episode exploration and in-context policy adaptation via self-reflection, eliminating the need for gradient updates during test-time adaptation. Across Sokoban, MineSweeper, and Webshop, LaMer yields consistent improvements over prompting and RL baselines, with strong test-time scaling and better generalization to harder or unseen tasks. The results demonstrate that meta-learning can imbue language agents with robust exploration strategies, improving adaptation in novel environments and long-horizon decision-making tasks.

Abstract

Reinforcement learning (RL) has enabled the training of large language model (LLM) agents to interact with the environment and to solve multi-turn long-horizon tasks. However, the RL-trained agents often struggle in tasks that require active exploration and fail to efficiently adapt from trial-and-error experiences. In this paper, we present LaMer, a general Meta-RL framework that enables LLM agents to actively explore and learn from the environment feedback at test time. LaMer consists of two key components: (i) a cross-episode training framework to encourage exploration and long-term rewards optimization; and (ii) in-context policy adaptation via reflection, allowing the agent to adapt their policy from task feedback signal without gradient update. Experiments across diverse environments show that LaMer significantly improves performance over RL baselines, with 11%, 14%, and 19% performance gains on Sokoban, MineSweeper and Webshop, respectively. Moreover, LaMer also demonstrates better generalization to more challenging or previously unseen tasks compared to the RL-trained agents. Overall, our results demonstrate that Meta-RL provides a principled approach to induce exploration in language agents, enabling more robust adaptation to novel environments through learned exploration strategies.

Paper Structure

This paper contains 34 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of RL and Meta-RL training on the MineSweeper environment. Left: Meta-RL training with LaMer retains higher sample diversity from the base model while achieving better success rates, reaching a better trade-off between exploration and exploitation. Right: Distinct trajectories and their empirical probabilities aggregated over multiple sampled trajectories in the MineSweeper environment. Each trajectory corresponds to a sequence of clicks (numbered cell) on the board. Sample diversity is quantified by the entropy of the empirical distribution. The Meta-RL trained model produces more diverse and explorative trajectories.
  • Figure 2: Comparison between the training processes of RL (top) and Meta-RL used in LaMer (bottom). For a single task, RL generates a group of trajectories independently. In contrast, in LaMer we use Meta-RL and produce the trajectories sequentially and adapt the policy in-context with self-reflection. Trajectory discount factor $\gamma_{\text{traj}}$ is used for cross-episode credit assignment.
  • Figure 3: Trajectory diversity of base and trained models. Compared to RL, Meta-RL preserves more diverse trajectories from the base model, striking a better balance between exploration and exploitation.
  • Figure 4: Performance of RL and Meta-RL trained model on the tasks with increased difficulty. For Sokoban, we gradually increase the number of boxes and for MineSweeper, we increase the number of mines in the grid.
  • Figure 5: Success rates of models trained with different $\gamma_{\text{traj}}$. A higher value encourages more exploration during training.
  • ...and 1 more figures