Table of Contents
Fetching ...

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan, Martin Andrews

TL;DR

This work demonstrates that reinforcement learning can significantly enhance long-horizon, multi-turn search by training LLM-based agents to effectively utilize tools for legal document retrieval. Using a 14B model with LoRA adapters and a Gemini 2.5 Pro reward model, the authors show RL-trained agents outperform frontier API-only models on a legal benchmark, particularly as turn horizons grow. They systematically study turn-restricted inference and training, revealing that longer multi-turn horizons enable more sophisticated exploration strategies, while restrictive training impedes learning due to insufficient positive feedback for GRPO. The findings highlight RL as a practical approach to optimize iterative search with tools, with implications for complex, multi-turn information retrieval beyond the legal domain.

Abstract

Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

TL;DR

This work demonstrates that reinforcement learning can significantly enhance long-horizon, multi-turn search by training LLM-based agents to effectively utilize tools for legal document retrieval. Using a 14B model with LoRA adapters and a Gemini 2.5 Pro reward model, the authors show RL-trained agents outperform frontier API-only models on a legal benchmark, particularly as turn horizons grow. They systematically study turn-restricted inference and training, revealing that longer multi-turn horizons enable more sophisticated exploration strategies, while restrictive training impedes learning due to insufficient positive feedback for GRPO. The findings highlight RL as a practical approach to optimize iterative search with tools, with implications for complex, multi-turn information retrieval beyond the legal domain.

Abstract

Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Performance of multi-turn agents under turn restrictions
  • Figure 2: Effect of restricting turns during RL training