Table of Contents
Fetching ...

Open Domain Question Answering with Conflicting Contexts

Siyi Liu, Qiang Ning, Kishaloy Halder, Wei Xiao, Zheng Qi, Phu Mon Htut, Yi Zhang, Neha Anna John, Bonan Min, Yassine Benajiba, Dan Roth

TL;DR

Open-domain QA systems rely on retrieved web contexts, which frequently contain conflicting information that can mislead answer generation. The authors introduce QACC, a human-annotated dataset showing conflicts in about $25\%$ of unambiguous questions, and benchmark three LLMs to reveal brittleness under conflicting contexts. They show that finetuning LLMs on human explanations improves reasoning and QA performance, with transfer to a perturbed NQ-Open dataset, suggesting explanations as a practical cue for handling conflicts. This work provides dataset, prompts, and methodological insights to guide the development of conflict-aware open-domain QA systems and reduces risk of incorrect answers in real-world deployments.

Abstract

Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.

Open Domain Question Answering with Conflicting Contexts

TL;DR

Open-domain QA systems rely on retrieved web contexts, which frequently contain conflicting information that can mislead answer generation. The authors introduce QACC, a human-annotated dataset showing conflicts in about of unambiguous questions, and benchmark three LLMs to reveal brittleness under conflicting contexts. They show that finetuning LLMs on human explanations improves reasoning and QA performance, with transfer to a perturbed NQ-Open dataset, suggesting explanations as a practical cue for handling conflicts. This work provides dataset, prompts, and methodological insights to guide the development of conflict-aware open-domain QA systems and reduces risk of incorrect answers in real-world deployments.

Abstract

Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.

Paper Structure

This paper contains 33 sections, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Google search results when querying the question "When did Kendrick Lamars first album come out?". We can see that here different answers (July 2, 2011 / June 17, 2003 / end of the 2010s) are suggested by Google and it is difficult for a language model to decide which to believe in.
  • Figure 2: Data Collection Pipeline. The left side shows an example input given to the annotators, and the right side shows an example annotated result. During the annotation process, we ask the annotators to identify different possible answers given each context, and decide there is a conflict if there is more than one possible answer. In addition, we ask the annotators to select from a pre-defined list of reasons, and provide a natural language explanation of their decision. In this example, the annotators believe there are three possible answers, and they think 0.615 earth years is the correct answer because it's validated by most trustworthy sources.
  • Figure 3: Reasons of annotators selecting one correct answer over the others when there are conflicts. "Majority" means the answer is supported by the most contexts. "Source" means the annotator trust the contexts more because they come from trustworthy sources. "Common Sense" means the answer matches their own memory and common sense. "Time" means they think one answer is correct since it's the most up-to-date.
  • Figure 4: Different types of questions in our dataset that have conflicts.
  • Figure 5: Few-shot GPT-4o performance on the test set of QACC that has conflicting contexts. The x-axis indicates the different types of questions and the y-axis denotes the F1 score for each type.
  • ...and 10 more figures