Table of Contents
Fetching ...

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Teng Xiao, Yige Yuan, Hamish Ivison, Huaisheng Zhu, Faeze Brahman, Nathan Lambert, Pradeep Dasigi, Noah A. Smith, Hannaneh Hajishirzi

Abstract

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.

Meta-Reinforcement Learning with Self-Reflection for Agentic Search

Abstract

This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.
Paper Structure (33 sections, 10 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 10 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: RL-based agents (a) condition solely on the current episode, and episodes are independent, whereas meta-RL-based agents (b) leverage context accumulated across episodes. MR-Search performs sequential self-reflection over past episodes to guide exploration in subsequent episodes. In MR-Search, we have inner-episodes, each consisting of a maximum of $T$ interactions steps with an answer. A sequence of $N$ episodes forms a meta-episode.
  • Figure 2: An overview of our proposed MR-Search framework. Given a question, the agent first completes an initial episode by interleaving reasoning and tool calls. It then enters an iterative self-reflection loop, where previous episodes serve as experience to inform subsequent searches and answer revisions, enabling iterative improvement across episodes.
  • Figure 3: We evaluate MR-Search, Search-R1 with sequential reflection inference (Search-R1-S), and Search-R1 with parallel sampling (Search-R1-P), selecting the most frequent answer among the generated trajectories. Shaded regions show the standard deviation across 3 runs. We observe that MR-Search achieves the best performance. See § \ref{['sec:analysis']} for details
  • Figure 4: Test performance, training curves of reward and search frequency on ASearcher, evaluated with Qwen2.5-7B-Base. Additional results are provided in Appendix \ref{['appendix:training_dynamic']}.
  • Figure 5: MR-Search, Search-R1 with sequential reflection turns (Search-R1-S), and Search-R1 with parallel sampling (Search-R1-P), selecting the most frequent answer among them.
  • ...and 3 more figures