Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation
Alireza Salemi, Chris Samarinas, Hamed Zamani
TL;DR
Plan-and-Refine (P&R) tackles the dual challenges of diversity and factuality in retrieval-augmented generation by coupling a global planning phase that yields diverse, plan-based retrieval strategies with a local refinement phase that iteratively edits proposals. A planner, retriever, generator, editor, and reward model form a pipeline in which multiple plans generate candidate responses, which are refined and then scored to select the best by a metric that combines factuality and coverage (ICAT). Empirical results on ANTIQUE and TREC Web Track demonstrate that P&R outperforms strong baselines, with up to 13.1% relative gains in ICAT-A and clear improvements in both coverage and factuality, supported by a human preference study. The work suggests that increasing plan diversity and applying staged refinement under a constrained retrieval budget yields more complete, accurate, and user-aligned long-form responses for information-seeking tasks.
Abstract
This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.
