REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering
Yijie Zhu, Haojie Zhou, Wanting Hong, Tailin Liu, Ning Wang
TL;DR
The paper tackles the challenge of multi-hop question answering with retrieval-augmented generation by addressing the lack of global planning and reliable fact extraction. It introduces REAP, a dual-module framework comprising a Sub-task Planner (SP) and a Fact Extractor (FE) that iteratively refine a structured plan and a grounded set of facts, enabling more coherent and traceable reasoning. A unified multi-task fine-tuning strategy improves planning capabilities, particularly for the data-scarce Re-Planner, and an asymmetric modular design enhances efficiency. Extensive experiments across four MHQA benchmarks show that REAP achieves state-of-the-art performance in both in-domain and out-of-domain settings, significantly outperforming baselines including standard RAG and several iterative methods, while maintaining favorable efficiency and scalability characteristics.
Abstract
Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP's performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.
