Table of Contents
Fetching ...

Atomic Self-Consistency for Better Long Form Generations

Raghuveer Thirukovalluru, Yukun Huang, Bhuwan Dhingra

TL;DR

This work tackles the recall aspect of long-form QA by introducing Atomic Self-Consistency (ASC), a black-box method that merges authentic atomic facts drawn from multiple stochastic samples to produce a superior composite answer. ASC decomposes candidate generations into atomic facts, clusters and filters them by a consistency-driven score, and then synthesizes a final answer by summarizing the selected cluster representatives. Across ASQA, QAMPARI, QUEST, and ELI5, ASC consistently outperforms single-sample baselines and USC-based approaches, demonstrating improved recall without sacrificing precision and often achieving higher fluency. The paper also provides thorough ablations and entropy-based analyses, showing how to control the method with a Theta parameter and indicating substantial untapped potential for further gains by integrating ASC with additional verification strategies and optimized sampling budgets.

Abstract

Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama2. Our analysis also reveals untapped potential for enhancing long-form generations using approach of merging multiple samples.

Atomic Self-Consistency for Better Long Form Generations

TL;DR

This work tackles the recall aspect of long-form QA by introducing Atomic Self-Consistency (ASC), a black-box method that merges authentic atomic facts drawn from multiple stochastic samples to produce a superior composite answer. ASC decomposes candidate generations into atomic facts, clusters and filters them by a consistency-driven score, and then synthesizes a final answer by summarizing the selected cluster representatives. Across ASQA, QAMPARI, QUEST, and ELI5, ASC consistently outperforms single-sample baselines and USC-based approaches, demonstrating improved recall without sacrificing precision and often achieving higher fluency. The paper also provides thorough ablations and entropy-based analyses, showing how to control the method with a Theta parameter and indicating substantial untapped potential for further gains by integrating ASC with additional verification strategies and optimized sampling budgets.

Abstract

Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama2. Our analysis also reveals untapped potential for enhancing long-form generations using approach of merging multiple samples.
Paper Structure (27 sections, 8 figures, 4 tables)

This paper contains 27 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A$_1$: A precise answer. A$_2$: An answer with higher recall of atomic facts relevant to the question Q.
  • Figure 2: Best possible recall (oracle performance) with increasing number of samples on ASQA(ChatGPT). Merging subparts from multiple samples has a much higher ceiling. ASC beats USC, Direct; almost matches the ceiling performance of picking one best sample.
  • Figure 3: Overall Pipeline proposed. Generated samples are split into smaller parts and clustered. Clusters are then filtered by a consistency based criterion (higher strength clusters are selected while lower strength clusters are removed). Selected cluster representatives are then summarized by an LLM to generate a final answer.
  • Figure 4: ASQA. Increasing $\Theta$ improves QA-F1, reduces Mauve. Adjusting $\Theta$ produces a preferred answer.
  • Figure 5: QAMPARI. Performance starts to stagnate when clusters' entropy stagnates.
  • ...and 3 more figures