Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

Xinshuo Hu; Baotian Hu; Dongfang Li; Xiaoguang Li; Lifeng Shang

Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

Xinshuo Hu, Baotian Hu, Dongfang Li, Xiaoguang Li, Lifeng Shang

TL;DR

The paper investigates how generative models maintain factual grounding when contextual knowledge changes, introducing a knowledge-augmented generator and a Margin Failure Rate ($MFR$) metric to quantify faithfulness during context transfer. It reveals that memory hallucination occurs across multiple architectures (e.g., FiD, BART, T5) and is influenced by factors such as the scale of contextual knowledge and the presence of noisy or negative contexts. By constructing a Debatepedia-based long-form QA benchmark and evaluating with $\text{BERT-SCORE}$-driven margin checks, the study highlights the challenges of grounding under dynamic knowledge and the need for robust evaluation and mitigation strategies. The work provides a framework and dataset for systematically studying context-driven hallucinations and points toward future directions for improving faithfulness in practical, knowledge-enabled NLP systems.

Abstract

The present study introduces the knowledge-augmented generator, which is specifically designed to produce information that remains grounded in contextual knowledge, regardless of alterations in the context. Previous research has predominantly focused on examining hallucinations stemming from static input, such as in the domains of summarization or machine translation. However, our investigation delves into the faithfulness of generative question answering in the presence of dynamic knowledge. Our objective is to explore the existence of hallucinations arising from parametric memory when contextual knowledge undergoes changes, while also analyzing the underlying causes for their occurrence. In order to efficiently address this issue, we propose a straightforward yet effective measure for detecting such hallucinations. Intriguingly, our investigation uncovers that all models exhibit a tendency to generate previous answers as hallucinations. To gain deeper insights into the underlying causes of this phenomenon, we conduct a series of experiments that verify the critical role played by context in hallucination, both during training and testing, from various perspectives.

Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

TL;DR

The paper investigates how generative models maintain factual grounding when contextual knowledge changes, introducing a knowledge-augmented generator and a Margin Failure Rate (

) metric to quantify faithfulness during context transfer. It reveals that memory hallucination occurs across multiple architectures (e.g., FiD, BART, T5) and is influenced by factors such as the scale of contextual knowledge and the presence of noisy or negative contexts. By constructing a Debatepedia-based long-form QA benchmark and evaluating with

-driven margin checks, the study highlights the challenges of grounding under dynamic knowledge and the need for robust evaluation and mitigation strategies. The work provides a framework and dataset for systematically studying context-driven hallucinations and points toward future directions for improving faithfulness in practical, knowledge-enabled NLP systems.

Abstract

Paper Structure (20 sections, 2 equations, 5 figures, 3 tables)

This paper contains 20 sections, 2 equations, 5 figures, 3 tables.

Introduction
Related Work
Faithful Natural Language Generation
Context Transfer
Methods
Task: Question Answering under Context Transfer
Measure: Margin Failure Rate
Results
All models have memory hallucination under context transfer.
Analysis
Impact of Contextual Knowledge Scale
Impact of Irrelevant Noisy Context
Conclusion
Benchmark Dataset
Evaluation Metrics
...and 5 more sections

Figures (5)

Figure 1: An example of generated hallucination from training memory. The model disregards the transferred contextual knowledge and predicts an out-of-date answer that was present in its original training data when answering the same question. Non-essential details are ignored by [...].
Figure 2: The influence of the scale of contextual knowledge and training step on $\mathcal{MFR}$(BERT-Score).
Figure 3: The $\mathcal{MFR}$(BERT-Score) results over different settings of contexts.
Figure 4: The Pearson correlation of margin failure ratio from basic metrics with different margins.
Figure 5: The Pearson correlation of margin failure ratio from each metric and human evaluation.

Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

TL;DR

Abstract

Does the Generator Mind its Contexts? An Analysis of Generative Model Faithfulness under Context Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)