Rich Knowledge Sources Bring Complex Knowledge Conflicts: Recalibrating Models to Reflect Conflicting Evidence
Hung-Ting Chen, Michael J. Q. Zhang, Eunsol Choi
TL;DR
This work investigates how open retrieval QA models blend parametric knowledge with extensive evidence when faced with conflicting information. By simulating knowledge conflicts with up to $N=100$ passages and applying perturbations (entity substitutions and semantic changes), the authors show that models largely rely on retrieved passages and that parametric knowledge mainly mediates tie-breaking rather than producing novel answers. They find that model confidence does not reliably reflect knowledge conflicts, motivating a calibration-based approach to abstain when multiple plausible answers exist; calibrated abstention offers modest improvements and generalization challenges. Overall, the study highlights a need for improved aggregation across multiple conflicting sources and provides a framework for calibrating models to avoid overconfident single-answer predictions in the presence of conflicting evidence.
Abstract
Question answering models can use rich knowledge sources -- up to one hundred retrieved passages and parametric knowledge in the large-scale language model (LM). Prior work assumes information in such knowledge sources is consistent with each other, paying little attention to how models blend information stored in their LM parameters with that from retrieved evidence documents. In this paper, we simulate knowledge conflicts (i.e., where parametric knowledge suggests one answer and different passages suggest different answers) and examine model behaviors. We find retrieval performance heavily impacts which sources models rely on, and current models mostly rely on non-parametric knowledge in their best-performing settings. We discover a troubling trend that contradictions among knowledge sources affect model confidence only marginally. To address this issue, we present a new calibration study, where models are discouraged from presenting any single answer when presented with multiple conflicting answer candidates in retrieved evidences.
