Toward Faithful and Complete Answer Construction from a Single Document

Zhaoyang Chen; Cody Fleming

Toward Faithful and Complete Answer Construction from a Single Document

Zhaoyang Chen, Cody Fleming

TL;DR

This paper addresses the challenge of producing complete and faithful single-document answers with LLMs in safety-critical settings. It proposes EVE, a structured Extraction–Validation–Enumerate pipeline that replaces unconstrained generation with a verifiable, skeleton-driven process, enabling exponential reductions in omissions and hallucinations through multi-query extraction and validation. Theoretical analysis and STPA-based experiments demonstrate substantial gains in recall, precision, and F1 compared to single-pass generation, while highlighting fundamental limits of language-based reasoning and the potential value of combining formal verification. The work offers a practical, model-agnostic framework that tightens grounding in the source document and can be extended with external symbolic tools to achieve stronger guarantees in high-stakes domains.

Abstract

Modern large language models (LLMs) are powerful generators driven by statistical next-token prediction. While effective at producing fluent text, this design biases models toward high-probability continuations rather than exhaustive and faithful answers grounded in source content. As a result, directly applying LLMs lacks systematic mechanisms to ensure both completeness (avoiding omissions) and faithfulness (avoiding unsupported content), which fundamentally conflicts with core AI safety principles. To address this limitation, we present EVE, a structured framework for document-grounded reasoning. Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration. Empirically, this design enables consistent and simultaneous improvements in recall, precision, and F1-score: recall and precision increase by up to 24\% and 29\%, respectively, with a corresponding 31\% gain in F1-score. This effectively breaks the long-standing trade-off between coverage and accuracy typical of single-pass LLM generation, while also mitigating generation truncation caused by length limitations. At the same time, we emphasize that EVE exhibits performance saturation due to the inherent ambiguity of natural language, reflecting fundamental limits of language-based reasoning.

Toward Faithful and Complete Answer Construction from a Single Document

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 4 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Prompt-based structured reasoning: Chain-of-Thought (CoT), Tree-of-Thought (ToT), and their variants
Retrieval-augmented generation (RAG)
Post-hoc verification and critic models.
Program-aided and tool-using reasoning.
Decomposition frameworks and agent-style methods.
Theoretical Analysis
EVE algorithm explanation
Exponential Error Reduction through Aggregation and Multi-Stage Voting
Robust Enumeration via Multiple Attempts
Overall Completeness Probability
Experiment: STPA as a High-Rigor Reasoning Benchmark
Experimental Protocol.
Impact of Multi-Query Extraction.
...and 6 more sections

Figures (4)

Figure 1: comparison between traditional generation method and EVE
Figure 2: recall plot w.r.t number of independent queries
Figure 3: Extraction performance with varying numbers of independent extraction queries (q1 to q4), without validation
Figure 4: Effect of validation intensity (v0 to v4 independent queries) on precision and recall after extraction

Theorems & Definitions (4)

Claim 1: Exponential Omission Reduction of EVE
proof
Claim 2
proof

Toward Faithful and Complete Answer Construction from a Single Document

TL;DR

Abstract

Toward Faithful and Complete Answer Construction from a Single Document

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (4)