Table of Contents
Fetching ...

Evaluating Factuality in Generation with Dependency-level Entailment

Tanya Goyal, Greg Durrett

TL;DR

This work tackles factuality in text generation by shifting from sentence-level entailment to dependency-arc entailment (DAE), enabling per-arc factual judgments and online enforcement. It builds an automatically labeled training signal from paraphrase data to supervise arc-level entailment without costly human annotation. Empirically, DAE outperforms sentence-level entailment and question-generation approaches in both summarization and paraphrase filtering tasks, while also localizing the specific arcs responsible for factual errors. The approach offers a practical, interpretable method for improving generation fidelity and diagnosing factual failures.

Abstract

Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do not localize which part of a generation is non-factual. In this paper, we propose a new formulation of entailment that decomposes it at the level of dependency arcs. Rather than focusing on aggregate decisions, we instead ask whether the semantic relationship manifested by individual dependency arcs in the generated output is supported by the input. Human judgments on this task are difficult to obtain; we therefore propose a method to automatically create data based on existing entailment or paraphrase corpora. Experiments show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods or those based on question generation, while additionally localizing the erroneous parts of the generation.

Evaluating Factuality in Generation with Dependency-level Entailment

TL;DR

This work tackles factuality in text generation by shifting from sentence-level entailment to dependency-arc entailment (DAE), enabling per-arc factual judgments and online enforcement. It builds an automatically labeled training signal from paraphrase data to supervise arc-level entailment without costly human annotation. Empirically, DAE outperforms sentence-level entailment and question-generation approaches in both summarization and paraphrase filtering tasks, while also localizing the specific arcs responsible for factual errors. The approach offers a practical, interpretable method for improving generation fidelity and diagnosing factual failures.

Abstract

Despite significant progress in text generation models, a serious limitation is their tendency to produce text that is factually inconsistent with information in the input. Recent work has studied whether textual entailment systems can be used to identify factual errors; however, these sentence-level entailment models are trained to solve a different problem than generation filtering and they do not localize which part of a generation is non-factual. In this paper, we propose a new formulation of entailment that decomposes it at the level of dependency arcs. Rather than focusing on aggregate decisions, we instead ask whether the semantic relationship manifested by individual dependency arcs in the generated output is supported by the input. Human judgments on this task are difficult to obtain; we therefore propose a method to automatically create data based on existing entailment or paraphrase corpora. Experiments show that our dependency arc entailment model trained on this data can identify factual inconsistencies in paraphrasing and summarization better than sentence-level methods or those based on question generation, while additionally localizing the erroneous parts of the generation.

Paper Structure

This paper contains 29 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of our dependency arc entailment formulation using a filtered set of Stanford Enhanced Dependencies. The DAE model makes independent factuality decisions for each dependency arc from the two generated hypotheses.
  • Figure 2: Overview of our dependency arc entailment model. The input (premise) sentence and output (or prefix of the output) are encoded with a pre-trained model. The embeddings of the head and tail of an arc are selected, concatenated with an encoding of the dependency label, and fed to a classification layer to render the judgment.
  • Figure 3: Arc annotations from the automatic labelling strategy of Section \ref{['sec:data']}. Green (+) arcs are labelled entailed, red (-) arcs are non-entailed, and the gray arcs are unannotated.
  • Figure 4: Performance of the Electra-based MNLI model and the DAE model. The figure shows a much higher variance in reranking accuracy for the MNLI model, suggesting that the task-specific performance is not correlated with reranking performance.
  • Figure 5: Individual arc entailment probabilities for arcs in output sentences from the summarization test set falke2019ranking and the paraphrase test set. The $+/-$ superscript signifies the gold label for that arc. Our DAE model is able to localize errors in the output. Compared to this, the MNLI model computes a high entailment score for all arcs that are lexically similar.
  • ...and 1 more figures