Table of Contents
Fetching ...

Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

Vignesh Gokul, Srikanth Tenneti, Alwarappan Nakkiran

TL;DR

This work addresses contradictions that arise in retrieved documents within RAG systems by introducing a synthetic data-generation framework for three contradiction types (self-, pair-, and conditional-contradictions) and by evaluating LLMs as context validators across conflict detection, conflict-type classification, and segmentation tasks. Using HotpotQA as a data source and Claude-3 Sonnet for generation, the study reveals that contradiction detection remains challenging and that model performance is highly sensitive to architecture and prompting strategy; chain-of-thought prompts help Claude models but often hinder Llama models. The findings underscore the importance of model choice, prompt design, and task formulation in building robust RAG systems and point to future work on data quality control, additional conflict types, and effective contradiction resolution and presentation. Together, these contributions provide actionable insights for improving information consistency and trustworthiness in retrieval-augmented generation workflows.

Abstract

Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.

Contradiction Detection in RAG Systems: Evaluating LLMs as Context Validators for Improved Information Consistency

TL;DR

This work addresses contradictions that arise in retrieved documents within RAG systems by introducing a synthetic data-generation framework for three contradiction types (self-, pair-, and conditional-contradictions) and by evaluating LLMs as context validators across conflict detection, conflict-type classification, and segmentation tasks. Using HotpotQA as a data source and Claude-3 Sonnet for generation, the study reveals that contradiction detection remains challenging and that model performance is highly sensitive to architecture and prompting strategy; chain-of-thought prompts help Claude models but often hinder Llama models. The findings underscore the importance of model choice, prompt design, and task formulation in building robust RAG systems and point to future work on data quality control, additional conflict types, and effective contradiction resolution and presentation. Together, these contributions provide actionable insights for improving information consistency and trustworthiness in retrieval-augmented generation workflows.

Abstract

Retrieval Augmented Generation (RAG) systems have emerged as a powerful method for enhancing large language models (LLMs) with up-to-date information. However, the retrieval step in RAG can sometimes surface documents containing contradictory information, particularly in rapidly evolving domains such as news. These contradictions can significantly impact the performance of LLMs, leading to inconsistent or erroneous outputs. This study addresses this critical challenge in two ways. First, we present a novel data generation framework to simulate different types of contradictions that may occur in the retrieval stage of a RAG system. Second, we evaluate the robustness of different LLMs in performing as context validators, assessing their ability to detect contradictory information within retrieved document sets. Our experimental results reveal that context validation remains a challenging task even for state-of-the-art LLMs, with performance varying significantly across different types of contradictions. While larger models generally perform better at contradiction detection, the effectiveness of different prompting strategies varies across tasks and model architectures. We find that chain-of-thought prompting shows notable improvements for some models but may hinder performance in others, highlighting the complexity of the task and the need for more robust approaches to context validation in RAG systems.

Paper Structure

This paper contains 13 sections, 1 equation, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Different types of contradictions in the retrieved documents.
  • Figure 2: Analysis of contradiction detection performance: (a) comparison across different contradiction types (self, pair, and conditional) and (b) effect of statement importance (most vs. least) on detection accuracy across different models and prompting strategies.
  • Figure 3: Analysis of positioning and evidence length effects: (a) performance comparison between near and far document positioning, and (b) impact of conflicting evidence length on detection accuracy across different models and prompting strategies.