Table of Contents
Fetching ...

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano

TL;DR

This work tackles the problem of detecting omissions in LLM-generated text by introducing three automated comprehensiveness metrics—NLI-based, Q&A-based, and end-to-end—each producing a set of covered and uncovered atomic facts relative to a reference corpus and yielding a comprehensiveness score. The NLI-based approach builds a graph of atomic statements with entailment relations; the Q&A-based approach uses question answering and answer comparison to form a similar graph; the end-to-end approach directly identifies missing content via an LLM without intermediate steps. Across WikiContradict and ConflictBank benchmarks, the end-to-end method often performs best, though with trade-offs in robustness, granularity, and interpretability, while the Q&A approach offers strong robustness and more interpretable diagnostics. The authors also evaluate open-weight LLMs on real-world, retrieval-augmented queries from Reddit, finding gpt-oss-120b to provide the strongest overall comprehensiveness. Limitations include sensitivity to background quality, computational cost for fine-grained variants, and reliance on evaluator models, highlighting the need for trustworthy corpora and careful deployment in practice.

Abstract

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

TL;DR

This work tackles the problem of detecting omissions in LLM-generated text by introducing three automated comprehensiveness metrics—NLI-based, Q&A-based, and end-to-end—each producing a set of covered and uncovered atomic facts relative to a reference corpus and yielding a comprehensiveness score. The NLI-based approach builds a graph of atomic statements with entailment relations; the Q&A-based approach uses question answering and answer comparison to form a similar graph; the end-to-end approach directly identifies missing content via an LLM without intermediate steps. Across WikiContradict and ConflictBank benchmarks, the end-to-end method often performs best, though with trade-offs in robustness, granularity, and interpretability, while the Q&A approach offers strong robustness and more interpretable diagnostics. The authors also evaluate open-weight LLMs on real-world, retrieval-augmented queries from Reddit, finding gpt-oss-120b to provide the strongest overall comprehensiveness. Limitations include sensitivity to background quality, computational cost for fine-grained variants, and reliance on evaluator models, highlighting the need for trustworthy corpora and careful deployment in practice.

Abstract

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

Paper Structure

This paper contains 39 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: Illustration of two scenarios in which comprehensiveness evaluation could reveal issues with the model responses. (a) The model provides a factually precise but incomplete answer, failing to mention more important and relevant facts. (b) Despite conflicting evidence, the LLM only presents a one-sided view without acknowledging the conflict.
  • Figure 2: Overview of the three comprehensiveness metrics introduced in this work.
  • Figure 3: Results of comprehensiveness meta-evaluation on the WikiContradict dataset for different models and metric variants. The error bars indicate 95% confidence intervals determined using BCa bootstrap.
  • Figure 4: Results of comprehensiveness meta-evaluation on the ConflictBank dataset for different models and metric variants. The error bars indicate 95% confidence intervals determined using BCa bootstrap.
  • Figure 5: Averaged results of comprehensiveness meta-evaluation on both datasets for different models and metric variants. The error bars indicate 95% confidence intervals determined using BCa bootstrap.
  • ...and 2 more figures