Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation
Yuelyu Ji, Rui Meng, Zhuochun Li, Daqing He
TL;DR
This work targets the inefficiencies and hallucination risks in multi-hop retrieval-augmented generation by introducing EVO-RAG, a two-stage curriculum-guided RL framework. EVO-RAG uses a seven-dimensional step-level reward and a time-based scheduler to transition from broad exploration during discovery to concise, evidence-backed refinement, trained with a multi-head preference model via Direct Preference Optimization. Across four multi-hop QA benchmarks, it achieves substantial improvements in Exact Match and F1 while reducing retrieval depth and query waste; ablations confirm the value of curriculum staging and dynamic reward scheduling. The approach offers a general recipe for reliable, cost-efficient multi-hop RAG systems and suggests avenues for adaptive reward structures and broader applicability beyond QA.
Abstract
Retrieval-augmented generation (RAG) grounds large language models (LLMs) in up-to-date external evidence, yet existing multi-hop RAG pipelines still issue redundant subqueries, explore too shallowly, or wander through overly long search chains. We introduce EVO-RAG, a curriculum-guided reinforcement learning framework that evolves a query-rewriting agent from broad early-stage exploration to concise late-stage refinement. EVO-RAG couples a seven-factor, step-level reward vector (covering relevance, redundancy, efficiency, and answer correctness) with a time-varying scheduler that reweights these signals as the episode unfolds. The agent is trained with Direct Preference Optimization over a multi-head reward model, enabling it to learn when to search, backtrack, answer, or refuse. Across four multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle), EVO-RAG boosts Exact Match by up to 4.6 points over strong RAG baselines while trimming average retrieval depth by 15 %. Ablation studies confirm the complementary roles of curriculum staging and dynamic reward scheduling. EVO-RAG thus offers a general recipe for building reliable, cost-effective multi-hop RAG systems.
