Table of Contents
Fetching ...

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Shu Zhao, Tan Yu, Anbang Xu

TL;DR

This work tackles the limitation of single-query retrieval in reasoning-augmented search by introducing ExpandSearch, a reinforcement learning framework that first expands the query surface to improve recall and then distills retrieved content with a squeezer to preserve reasoning-relevant information. By separating recall and precision optimization, the approach yields strong gains on seven QA benchmarks, notably in multi-hop tasks, with a 3B LLM achieving state-of-the-art-like performance and outperforming several larger baselines. The method relies on an end-to-end RL objective with verifiable rewards and a modular squeezer that can be swapped at inference, enabling flexible deployment across tasks and resource constraints. Overall, ExpandSearch demonstrates that learned query expansion combined with selective information synthesis substantially enhances retrieval-augmented reasoning and paves the way for more efficient, scalable search agents.

Abstract

Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

TL;DR

This work tackles the limitation of single-query retrieval in reasoning-augmented search by introducing ExpandSearch, a reinforcement learning framework that first expands the query surface to improve recall and then distills retrieved content with a squeezer to preserve reasoning-relevant information. By separating recall and precision optimization, the approach yields strong gains on seven QA benchmarks, notably in multi-hop tasks, with a 3B LLM achieving state-of-the-art-like performance and outperforming several larger baselines. The method relies on an end-to-end RL objective with verifiable rewards and a modular squeezer that can be swapped at inference, enabling flexible deployment across tasks and resource constraints. Overall, ExpandSearch demonstrates that learned query expansion combined with selective information synthesis substantially enhances retrieval-augmented reasoning and paves the way for more efficient, scalable search agents.

Abstract

Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.

Paper Structure

This paper contains 23 sections, 3 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between Search-R1 and the proposed ExpandSearch Previous methods (a) miss critical information when using single queries and suffer from context overload. Our approach (b) generates multiple query variants to achieve comprehensive coverage while using a squeezer model to distill retrieved content to only reasoning-relevant information.
  • Figure 2: Impact of query expansion count ($n$) on EM accuracy across benchmarks. Increasing the number of rephrased queries from $1$ to $3$ yields substantial EM accuracy gains.
  • Figure 3: Effect of retrieval depth ($k$) on EM accuracy. Increasing $k$ from $3$ to $10$ chunks allows the model to cover more information from the expanded queries, yielding consistent improvements across benchmarks.
  • Figure 4: Training dynamics showing reward, response length, and search frequency. The synchronized growth of reward, response length, and search frequency reveals that the model autonomously learns expanded retrieval as an optimal strategy. Smooth training curves indicate stable optimization without collapse.
  • Figure 5: Performance comparison using different squeezers at inference time. Replacing the training-time squeezer (LLaMA-4-17b) with LLaMA-3.1-8b during inference maintains comparable performance, validating the modular architecture.