Table of Contents
Fetching ...

FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering

Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma

TL;DR

FrugalRAG is proposed, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty, and attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half.

Abstract

Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains, often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously to optimize both final answer accuracy and efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10x more data, our method achieves competitive performance with only approximately 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL not to increase reasoning steps, but to reduce them, as an effective solution for scalable and efficient RAG.

FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering

TL;DR

FrugalRAG is proposed, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty, and attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half.

Abstract

Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains, often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously to optimize both final answer accuracy and efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding policies that balance correctness with frugality. Unlike prior approaches requiring 10x more data, our method achieves competitive performance with only approximately 1,000 examples. On HotPotQA and other multi-hop QA benchmarks, FrugalRAG attains state-of-the-art efficiency-accuracy tradeoffs, cutting retrieval cost nearly in half. Moreover, on the challenging BrowseCompPlus benchmark, it generalizes zero-shot and surpasses SLM-based and other baselines. These results demonstrate the use of RL not to increase reasoning steps, but to reduce them, as an effective solution for scalable and efficient RAG.

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 15 tables, 1 algorithm.

Figures (3)

  • Figure 1: FrugalRAG on average outperforms fixed budget and SFT based baselines, demonstrating the effectiveness of both Stage-1 finetuning and Stage-2 learning to control test time compute. The plots show the Tradeoff metrics (See Table \ref{['tab:sft_comparison']} for detailed results).
  • Figure 2: FrugalRAG tested using different maximum budgets $B$ on HotPotQA. We find that our approach achieves maximum recall at the training budget followed by diminishing returns on subsequent searches.
  • Figure 3: FrugalRAG trained with varying number of supervised examples. We demonstrate that FrugalRAG is robust even in low-data regimes and performance improves consistently with increasing number of examples.