Table of Contents
Fetching ...

PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

Xingyu Li, Rongguang Wang, Yuying Wang, Mengqing Guo, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth

Abstract

Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR$^2$-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR$^2$-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR$^2$-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR$^2$-RAG achieves up to \textbf{23.5\%} higher accuracy, with retrieval gains of up to \textbf{10.5\%} in NDCG.

PAR$^2$-RAG: Planned Active Retrieval and Reasoning for Multi-Hop Question Answering

Abstract

Large language models (LLMs) remain brittle on multi-hop question answering (MHQA), where answering requires combining evidence across documents through retrieval and reasoning. Iterative retrieval systems can fail by locking onto an early low-recall trajectory and amplifying downstream errors, while planning-only approaches may produce static query sets that cannot adapt when intermediate evidence changes. We propose \textbf{Planned Active Retrieval and Reasoning RAG (PAR-RAG)}, a two-stage framework that separates \emph{coverage} from \emph{commitment}. PAR-RAG first performs breadth-first anchoring to build a high-recall evidence frontier, then applies depth-first refinement with evidence sufficiency control in an iterative loop. Across four MHQA benchmarks, PAR-RAG consistently outperforms existing state-of-the-art baselines, compared with IRCoT, PAR-RAG achieves up to \textbf{23.5\%} higher accuracy, with retrieval gains of up to \textbf{10.5\%} in NDCG.

Paper Structure

This paper contains 33 sections, 5 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall architecture of PAR$^2$-RAG. Stage 1 (Coverage Anchor) expands evidence breadth to build $C_{\text{anchor}}$, and Stage 2 (Iterative Chain) performs ESC-gated refinement to either continue targeted retrieval or stop with the final answer.
  • Figure 2: Step robustness results on answer quality, each panel compares IRCoT, Coverage Anchor, Iterative Chain, and PAR$^2$-RAG across step counts $\\{3,5,7,10\\}$.
  • Figure 3: Prompt for correctness judgment.
  • Figure 4: Prompt for planner.
  • Figure 5: Prompt for searcher.
  • ...and 1 more figures