Atomic Self-Consistency for Better Long Form Generations
Raghuveer Thirukovalluru, Yukun Huang, Bhuwan Dhingra
TL;DR
This work tackles the recall aspect of long-form QA by introducing Atomic Self-Consistency (ASC), a black-box method that merges authentic atomic facts drawn from multiple stochastic samples to produce a superior composite answer. ASC decomposes candidate generations into atomic facts, clusters and filters them by a consistency-driven score, and then synthesizes a final answer by summarizing the selected cluster representatives. Across ASQA, QAMPARI, QUEST, and ELI5, ASC consistently outperforms single-sample baselines and USC-based approaches, demonstrating improved recall without sacrificing precision and often achieving higher fluency. The paper also provides thorough ablations and entropy-based analyses, showing how to control the method with a Theta parameter and indicating substantial untapped potential for further gains by integrating ASC with additional verification strategies and optimized sampling budgets.
Abstract
Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama2. Our analysis also reveals untapped potential for enhancing long-form generations using approach of merging multiple samples.
