Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering
Zhengliang Shi, Weiwei Sun, Shen Gao, Pengjie Ren, Zhumin Chen, Zhaochun Ren
TL;DR
The paper addresses the challenge of multi-hop question answering by identifying the limitations of the retrieve-then-read paradigm, notably retriever dependence and noise in retrieved documents. It introduces GenGround, a generate-then-ground framework in which an LLM alternates between answer deduction (formulating simple sub-questions and producing immediate answers) and instructional grounding (grounding those answers in retrieved evidence to revise errors), with a batch grounding strategy for efficiency. To extend applicability to smaller models, the authors propose Instructional Grounding Distillation (IDG), which uses a 50k single-hop synthetic dataset to distill grounding trajectories into a student model, enabling competitive performance with fewer parameters. Extensive experiments on four MHQA benchmarks show GenGround outperforming strong baselines across metrics, with IDG further empowering smaller models to approach or exceed baselines, indicating both improved accuracy and practical efficiency for real-world retrieval-augmented QA systems.
Abstract
Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
