Reinforcement Learning for Long-Horizon Multi-Turn Search Agents
Vivek Kalyan, Martin Andrews
TL;DR
This work demonstrates that reinforcement learning can significantly enhance long-horizon, multi-turn search by training LLM-based agents to effectively utilize tools for legal document retrieval. Using a 14B model with LoRA adapters and a Gemini 2.5 Pro reward model, the authors show RL-trained agents outperform frontier API-only models on a legal benchmark, particularly as turn horizons grow. They systematically study turn-restricted inference and training, revealing that longer multi-turn horizons enable more sophisticated exploration strategies, while restrictive training impedes learning due to insufficient positive feedback for GRPO. The findings highlight RL as a practical approach to optimize iterative search with tools, with implications for complex, multi-turn information retrieval beyond the legal domain.
Abstract
Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
