Table of Contents
Fetching ...

Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

Sheryl Hsu, Omar Khattab, Chelsea Finn, Archit Sharma

TL;DR

LeReT introduces Learning to Retrieve by Trying, a reinforcement learning framework that diversifies query generation through prompt-driven exploration and trains the query generator with preference-based optimization (IPO) using per-hop rewards. By coupling prompt diversification, context distillation, and greedy per-hop updates, LeReT significantly improves multi-hop retrieval and downstream grounding across HotpotQA and HoVer, with gains up to 29% in retrieval and notable downstream improvements for stronger LLM generators. The method proves versatile across retrievers and supports iterative training to further boost performance, underscoring a practical, general approach to enhancing grounding in retrieval-augmented LLM systems. This work emphasizes that high-quality exploration data is crucial for successful RL in agentic pipelines and points to future extensions with indirect supervision and tool training.

Abstract

The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Unfortunately, LLMs often struggle with posing the right search queries, especially when dealing with complex or otherwise indirect topics. Observing that LLMs can learn to search for relevant facts by $\textit{trying}$ different queries and learning to up-weight queries that successfully produce relevant results, we introduce $\underline{Le}$arning to $\underline{Re}$trieve by $\underline{T}$rying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Project website: http://sherylhsu.com/LeReT/.

Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

TL;DR

LeReT introduces Learning to Retrieve by Trying, a reinforcement learning framework that diversifies query generation through prompt-driven exploration and trains the query generator with preference-based optimization (IPO) using per-hop rewards. By coupling prompt diversification, context distillation, and greedy per-hop updates, LeReT significantly improves multi-hop retrieval and downstream grounding across HotpotQA and HoVer, with gains up to 29% in retrieval and notable downstream improvements for stronger LLM generators. The method proves versatile across retrievers and supports iterative training to further boost performance, underscoring a practical, general approach to enhancing grounding in retrieval-augmented LLM systems. This work emphasizes that high-quality exploration data is crucial for successful RL in agentic pipelines and points to future extensions with indirect supervision and tool training.

Abstract

The hallucinations of large language models (LLMs) are increasingly mitigated by allowing LLMs to search for information and to ground their answers in real sources. Unfortunately, LLMs often struggle with posing the right search queries, especially when dealing with complex or otherwise indirect topics. Observing that LLMs can learn to search for relevant facts by different queries and learning to up-weight queries that successfully produce relevant results, we introduce arning to trieve by rying (LeReT), a reinforcement learning framework that explores search queries and uses preference-based optimization to improve their quality. LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary off-the-shelf retrievers and makes it a promising technique for improving general LLM pipelines. Project website: http://sherylhsu.com/LeReT/.

Paper Structure

This paper contains 22 sections, 3 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: LeReT significantly improves retrieval and generation. LeReT provides a reinforcement learning based framework for improving grounding and performance of LLM generated answers by improving the retrieval of relevant factual data.
  • Figure 2: An overview of the standard multi-hop retrieval pipeline we study in this work. A user asks a question to the system. In each hop, the LLM generates search queries for the retriever and receives a collection of documents. The overall set of retrieved documents and the user question are then given to a downstream LLM for grounded answer generation.
  • Figure 3: Overview of prompt driven diverse sampling and data generation. LeReT induces diverse but effective search queries by bootstrapping several few-shot prompts for query generation and uses the retrieval reward to collect preferred and dispreferred queries for each question's hop.
  • Figure 4: The model performance saturates quickly. Measuring the test performance of Llama 3 8b as training progresses on the preference dataset collected using LeReT on the full HotpotQA train set (90,447 HotpotQA questions, 494,208 preference pairs).