Table of Contents
Fetching ...

Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

Yuelyu Ji, Rui Meng, Zhuochun Li, Daqing He

TL;DR

This work targets the inefficiencies and hallucination risks in multi-hop retrieval-augmented generation by introducing EVO-RAG, a two-stage curriculum-guided RL framework. EVO-RAG uses a seven-dimensional step-level reward and a time-based scheduler to transition from broad exploration during discovery to concise, evidence-backed refinement, trained with a multi-head preference model via Direct Preference Optimization. Across four multi-hop QA benchmarks, it achieves substantial improvements in Exact Match and F1 while reducing retrieval depth and query waste; ablations confirm the value of curriculum staging and dynamic reward scheduling. The approach offers a general recipe for reliable, cost-efficient multi-hop RAG systems and suggests avenues for adaptive reward structures and broader applicability beyond QA.

Abstract

Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.

Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation

TL;DR

This work targets the inefficiencies and hallucination risks in multi-hop retrieval-augmented generation by introducing EVO-RAG, a two-stage curriculum-guided RL framework. EVO-RAG uses a seven-dimensional step-level reward and a time-based scheduler to transition from broad exploration during discovery to concise, evidence-backed refinement, trained with a multi-head preference model via Direct Preference Optimization. Across four multi-hop QA benchmarks, it achieves substantial improvements in Exact Match and F1 while reducing retrieval depth and query waste; ablations confirm the value of curriculum staging and dynamic reward scheduling. The approach offers a general recipe for reliable, cost-efficient multi-hop RAG systems and suggests avenues for adaptive reward structures and broader applicability beyond QA.

Abstract

Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.

Paper Structure

This paper contains 45 sections, 12 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of EVO-RAG's two-stage curriculum. In the initial Discovery stage, the agent broadly explores multiple retrieval pathways to identify potentially relevant documents. Subsequently, in the Refinement stage, the agent fine-tunes queries to produce concise, evidence-backed answers.
  • Figure 2: The query rewriting agent (top) interacts with the environment through four high-level actions and observes retrieved evidence at each step. Seven reward signals (middle) provide dense step-level feedback based on relevance, redundancy, efficiency, and final correctness. These signals are used to train a multi-head preference model and update the agent policy via Direct Preference Optimization (DPO, bottom). A two-stage curriculum shifts weight from early exploration to late refinement.
  • Figure 3: Reward weights for EVO-RAG training. Stage 1 and Stage 2 represent the Discovery and Refinement phases, respectively. Arrows indicate weight trends.
  • Figure 4: Sub-query length (top) and step count (bottom) distributions under various reward configurations.