Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

Hung-Ting Chen; Michael J. Q. Zhang; Eunsol Choi

Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

Hung-Ting Chen, Michael J. Q. Zhang, Eunsol Choi

TL;DR

This work investigates how open retrieval QA models blend parametric knowledge with extensive evidence when faced with conflicting information. By simulating knowledge conflicts with up to $N=100$ passages and applying perturbations (entity substitutions and semantic changes), the authors show that models largely rely on retrieved passages and that parametric knowledge mainly mediates tie-breaking rather than producing novel answers. They find that model confidence does not reliably reflect knowledge conflicts, motivating a calibration-based approach to abstain when multiple plausible answers exist; calibrated abstention offers modest improvements and generalization challenges. Overall, the study highlights a need for improved aggregation across multiple conflicting sources and provides a framework for calibrating models to avoid overconfident single-answer predictions in the presence of conflicting evidence.

Abstract

Question answering models can use rich knowledge sources -- up to one hundred retrieved passages and parametric knowledge in the large-scale language model (LM). Prior work assumes information in such knowledge sources is consistent with each other, paying little attention to how models blend information stored in their LM parameters with that from retrieved evidence documents. In this paper, we simulate knowledge conflicts (i.e., where parametric knowledge suggests one answer and different passages suggest different answers) and examine model behaviors. We find retrieval performance heavily impacts which sources models rely on, and current models mostly rely on non-parametric knowledge in their best-performing settings. We discover a troubling trend that contradictions among knowledge sources affect model confidence only marginally. To address this issue, we present a new calibration study, where models are discouraged from presenting any single answer when presented with multiple conflicting answer candidates in retrieved evidences.

Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

TL;DR

This work investigates how open retrieval QA models blend parametric knowledge with extensive evidence when faced with conflicting information. By simulating knowledge conflicts with up to

passages and applying perturbations (entity substitutions and semantic changes), the authors show that models largely rely on retrieved passages and that parametric knowledge mainly mediates tie-breaking rather than producing novel answers. They find that model confidence does not reliably reflect knowledge conflicts, motivating a calibration-based approach to abstain when multiple plausible answers exist; calibrated abstention offers modest improvements and generalization challenges. Overall, the study highlights a need for improved aggregation across multiple conflicting sources and provides a framework for calibrating models to avoid overconfident single-answer predictions in the presence of conflicting evidence.

Abstract

Paper Structure (38 sections, 1 equation, 3 figures, 18 tables)

This paper contains 38 sections, 1 equation, 3 figures, 18 tables.

Introduction
Background
Model
Fusion-in-Decoder (FiD)
Retrieval Augmented Generation (RAG)
Model Confidence Study
When do retrieval-based generation models rely on parametric knowledge?
Revisiting knowledge conflict study in Longpre2021EntityBasedKC
Takeaway
Simulating Mixed Bag of Evidence Passages
Entity Substitution
Setting.
Results.
Confidence Study.
Additional Analysis.
...and 23 more sections

Figures (3)

Figure 1: Models can use both parametric and non-parametric knowledge sources. In this example, the answer could be the U.S./Norway/Germany. We investigate for a given question which knowledge source was the most influential to output an answer. The model should be able to abstain from answering for these examples, as it is difficult for the model to decide which answer candidate is correct.
Figure 2: Substituting different proportion of retrieved passages containing gold answer spans on filtered NQ-Open (top) and Trivia QA (bottom) development set.
Figure 3: The ratio of calibration score after perturbation to that before perturbation, in log scale. The occurrences of examples of different ratio are plotted in terms of probability density (the area under curve is sum to 1). The distributions are bell-shaped, but shift slightly towards negative x-axis.

Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

TL;DR

Abstract

Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence

Authors

TL;DR

Abstract

Table of Contents

Figures (3)