Table of Contents
Fetching ...

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

Kevin Wu, Eric Wu, James Zou

TL;DR

This work analyzes how large language models arbitrate between their internal priors and retrieved external content when these sources conflict. It introduces ClashEval, a 1,294-question benchmark across six domains with systematically perturbed context, and benchmarks six top models to quantify context bias and prior bias. The study reveals substantial context-driven overrides of correct priors, but also identifies a negative relationship between perturbation realism and context adoption, as well as a strong link between model confidence (token probabilities) and context reliance. A simple calibration-based approach, including Calibrated Token Probability Correction, substantially improves accuracy and reduces context bias, suggesting a practical path toward more reliable RAG systems. Overall, ClashEval provides a rigorous, tunable framework for diagnosing and mitigating failures where external evidence conflicts with model priors, with implications for safer deployment of retrieval-augmented LLMs.

Abstract

Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response? To answer this, we curate a dataset of over 1200 questions across six domains (e.g., drug dosages, Olympic records, locations) along with content relevant to answering each question. We further apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content. Our results highlight a difficult task and benchmark for LLMs -- namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect.

ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence

TL;DR

This work analyzes how large language models arbitrate between their internal priors and retrieved external content when these sources conflict. It introduces ClashEval, a 1,294-question benchmark across six domains with systematically perturbed context, and benchmarks six top models to quantify context bias and prior bias. The study reveals substantial context-driven overrides of correct priors, but also identifies a negative relationship between perturbation realism and context adoption, as well as a strong link between model confidence (token probabilities) and context reliance. A simple calibration-based approach, including Calibrated Token Probability Correction, substantially improves accuracy and reduces context bias, suggesting a practical path toward more reliable RAG systems. Overall, ClashEval provides a rigorous, tunable framework for diagnosing and mitigating failures where external evidence conflicts with model priors, with implications for safer deployment of retrieval-augmented LLMs.

Abstract

Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response? To answer this, we curate a dataset of over 1200 questions across six domains (e.g., drug dosages, Olympic records, locations) along with content relevant to answering each question. We further apply precise perturbations to the answers in the content that range from subtle to blatant errors. We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time. However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content. Our results highlight a difficult task and benchmark for LLMs -- namely, their ability to correctly discern when it is wrong in light of correct retrieved content and to reject cases when the provided content is incorrect.
Paper Structure (19 sections, 6 figures, 7 tables)

This paper contains 19 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A schematic of generating modified documents for each dataset. A question is posed to the LLM with and without a reference document containing information relevant to the query. This document is then perturbed to contain modified information and given as context to the LLM. We then observe whether the LLM prefers the modified information or its own prior answer.
  • Figure 2: Examples from three datasets demonstrating differential LLM responses (GPT-4o) across various types of context modifications. Responses in red indicate wrong responses (different than the answer); responses in green indicate correct responses.
  • Figure 3: We observe an inverse relationship between the context preference rate (y-axis) and the amount of deviation from the prior (x-axis). Each plot visualizes absolute deviation from the reference information (for numerical datasets, up to two log-fold changes (along with the trendline); for "Years", the absolute number of years; for categorical datasets, a total of four modification categories) against context preference rate.
  • Figure 4: We additionally observe an inverse relationship between the context preference rate (y-axis) and the model's prior response probability (x-axis). Context preference rate is defined as the proportion of responses that align with the information presented in the prompt as context. The model's prior response probability is computed from the average log probability of the response tokens queried without context. Each plot visualizes the prior probability (grouped into 10 bins) against the context preference rate, along with the best-fit trend line and slope. Models that allow access to token probabilities are shown.
  • Figure 5: We plot the data from Table \ref{['tab:bias']} -- each model's performance across three metrics in different colors, along with 95% confidence intervals.
  • ...and 1 more figures