Table of Contents
Fetching ...

Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation

Alireza Salemi, Chris Samarinas, Hamed Zamani

TL;DR

Plan-and-Refine (P&R) tackles the dual challenges of diversity and factuality in retrieval-augmented generation by coupling a global planning phase that yields diverse, plan-based retrieval strategies with a local refinement phase that iteratively edits proposals. A planner, retriever, generator, editor, and reward model form a pipeline in which multiple plans generate candidate responses, which are refined and then scored to select the best by a metric that combines factuality and coverage (ICAT). Empirical results on ANTIQUE and TREC Web Track demonstrate that P&R outperforms strong baselines, with up to 13.1% relative gains in ICAT-A and clear improvements in both coverage and factuality, supported by a human preference study. The work suggests that increasing plan diversity and applying staged refinement under a constrained retrieval budget yields more complete, accurate, and user-aligned long-form responses for information-seeking tasks.

Abstract

This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.

Plan-and-Refine: Diverse and Comprehensive Retrieval-Augmented Generation

TL;DR

Plan-and-Refine (P&R) tackles the dual challenges of diversity and factuality in retrieval-augmented generation by coupling a global planning phase that yields diverse, plan-based retrieval strategies with a local refinement phase that iteratively edits proposals. A planner, retriever, generator, editor, and reward model form a pipeline in which multiple plans generate candidate responses, which are refined and then scored to select the best by a metric that combines factuality and coverage (ICAT). Empirical results on ANTIQUE and TREC Web Track demonstrate that P&R outperforms strong baselines, with up to 13.1% relative gains in ICAT-A and clear improvements in both coverage and factuality, supported by a human preference study. The work suggests that increasing plan diversity and applying staged refinement under a constrained retrieval budget yields more complete, accurate, and user-aligned long-form responses for information-seeking tasks.

Abstract

This paper studies the limitations of (retrieval-augmented) large language models (LLMs) in generating diverse and comprehensive responses, and introduces the Plan-and-Refine (P&R) framework based on a two phase system design. In the global exploration phase, P&R generates a diverse set of plans for the given input, where each plan consists of a list of diverse query aspects with corresponding additional descriptions. This phase is followed by a local exploitation phase that generates a response proposal for the input query conditioned on each plan and iteratively refines the proposal for improving the proposal quality. Finally, a reward model is employed to select the proposal with the highest factuality and coverage. We conduct our experiments based on the ICAT evaluation methodology--a recent approach for answer factuality and comprehensiveness evaluation. Experiments on the two diverse information seeking benchmarks adopted from non-factoid question answering and TREC search result diversification tasks demonstrate that P&R significantly outperforms baselines, achieving up to a 13.1% improvement on the ANTIQUE dataset and a 15.41% improvement on the TREC dataset. Furthermore, a smaller scale user study confirms the substantial efficacy of the P&R framework.

Paper Structure

This paper contains 37 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An overview of the P&R framework.
  • Figure 2: Effect of (A) threshold ($Z$) for planner training, (B) local/global steps, and (C) total budget on P&R's performance on ANTIQUE. Larger versions appear in Figures \ref{['fig:self-training-percentile']}, \ref{['fig:local-global-steps']}, and \ref{['fig:gen-budget']} in Appendix \ref{['app:figures']}.
  • Figure 3: The prompt templates used with different components in the P&R framework.
  • Figure 4: The prompts used by the baselines.
  • Figure 5: Effect of generated plan selection threshold for self-training planner on performance for the ANTIQUE dataset.
  • ...and 3 more figures