Table of Contents
Fetching ...

Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

Yifu Qiu, Varun Embar, Shay B. Cohen, Benjamin Han

TL;DR

Hallucinations in knowledge-to-text generation undermine factual accuracy. The paper introduces TWEAK, a decoding-time strategy that augments standard decoding with hypothesis verification to judge the faithfulness of candidates without retraining generators. It evaluates two HVM variants: an off-the-shelf NLI-based verifier and a task-specific HVM trained on the FATE dataset, showing significant faithfulness gains with minimal quality loss on WebNLG and improved OOD robustness with the NLI variant. In-ID, the FATE-based HVM often outperforms NLI in both faithfulness and quality, while in OOD settings NLI can generalize better for faithfulness at times, though HVM maintains stronger quality. Overall, TWEAK provides a practical, plug-in approach to reduce hallucinations in knowledge-to-text pipelines and highlights the value of task-specific verification signals.

Abstract

Knowledge-to-text generators often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the input, or describe facts not present in the input. To reduce hallucinations, we propose a decoding-only method, TWEAK (Think While Effectively Articulating Knowledge), which can be integrated with any generator without retraining. TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on the extent to which their hypotheses are supported by the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with a minimal impact on the quality. We then replace the NLI model with a task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their original and perturbed descriptions. We test TWEAK with two generators, and the best TWEAK variants improve on average for the two models by 2.24/7.17 points in faithfulness (FactKB) in in/out-of-distribution evaluations, respectively, and with only a 0.14/0.32-point decline in quality (BERTScore).

Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation

TL;DR

Hallucinations in knowledge-to-text generation undermine factual accuracy. The paper introduces TWEAK, a decoding-time strategy that augments standard decoding with hypothesis verification to judge the faithfulness of candidates without retraining generators. It evaluates two HVM variants: an off-the-shelf NLI-based verifier and a task-specific HVM trained on the FATE dataset, showing significant faithfulness gains with minimal quality loss on WebNLG and improved OOD robustness with the NLI variant. In-ID, the FATE-based HVM often outperforms NLI in both faithfulness and quality, while in OOD settings NLI can generalize better for faithfulness at times, though HVM maintains stronger quality. Overall, TWEAK provides a practical, plug-in approach to reduce hallucinations in knowledge-to-text pipelines and highlights the value of task-specific verification signals.

Abstract

Knowledge-to-text generators often struggle to faithfully generate descriptions for the input facts: they may produce hallucinations that contradict the input, or describe facts not present in the input. To reduce hallucinations, we propose a decoding-only method, TWEAK (Think While Effectively Articulating Knowledge), which can be integrated with any generator without retraining. TWEAK treats the generated sequences at each decoding step and its future sequences as hypotheses, and ranks each generation candidate based on the extent to which their hypotheses are supported by the input facts using a Hypothesis Verification Model (HVM). We first demonstrate the effectiveness of TWEAK by using a Natural Language Inference (NLI) model as the HVM and report improved faithfulness with a minimal impact on the quality. We then replace the NLI model with a task-specific HVM trained with a first-of-a-kind dataset, FATE (Fact-Aligned Textual Entailment), which pairs input facts with their original and perturbed descriptions. We test TWEAK with two generators, and the best TWEAK variants improve on average for the two models by 2.24/7.17 points in faithfulness (FactKB) in in/out-of-distribution evaluations, respectively, and with only a 0.14/0.32-point decline in quality (BERTScore).
Paper Structure (26 sections, 6 equations, 9 figures, 9 tables)

This paper contains 26 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Our proposed TWEAK approach. Compared with beam search which solely ranks the candidates based on generative model's predicted likelihood, TWEAK incorporates faithfulness, which is estimated by evaluating the backward and forward hypotheses of each generation candidate with a Hypothesis Verification Model (HVM). In the 4th decoding step of this example, the beam search promotes the candidate leading to hallucinations (e.g., "United States"), but TWEAK demotes it using signals from HVM.
  • Figure 2: Our task-specific hypothesis verification model. It takes fact triples and backward/forward hypotheses as input, and predicts pair-wise faithfulness relations for each triple-hypothesis pair in a 2D table.
  • Figure 3: The effect on quality (BLEU) and faithfulness (FactKB) from choosing different $\alpha$ in Equ. \ref{['equ:tweak_scoring']}, with $\alpha = 0$ being equivalent to beam search. The results are obtained using TWEAK-NLI-B+F and TWEAK-HVM variants on WebNLG test set with BART.
  • Figure 4: Performance differences ($\Delta$) on quality (BLEU) and faithfulness (FactKB) between TWEAK-HVM, TWEAK-NLI-B+F and beam search on various beam sizes $\{2,4,6,8,10,15\}$. All experiments are done on WebNLG with BART-large.
  • Figure 5: The distributions of the relative positions where negative predictions (i.e., possible hallucination) happen during the decoding process. $0$ and $1$ along the horizontal axis represent the start and end of the decoding. The upper and bottom panel represent TWEAK-HVM and TWEAK-NLI-B+F running on WebNLG with BART-large, respectively.
  • ...and 4 more figures