Table of Contents
Fetching ...

MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

Jungyeon Lee, Kangmin Lee, Taeuk Kim

TL;DR

This work introduces MAGIC, a knowledge-graph–based benchmark for inter-context knowledge conflicts in retrieval-augmented generation. It combines subgraph extraction from Wikidata5M, systematic conflict generation via few-shot LLM prompting, and KG-to-text conversion to produce diverse, multi-hop scenarios with clear relational structure. Experiments across open-source and proprietary LLMs reveal that conflict detection remains challenging, especially for multi-hop cases, and localization of contradictions is often imperfect, motivating further work on reasoning, prompting strategies, and integration of diverse knowledge sources. The dataset and analyses provide a foundation for improving LLMs’ ability to reason across conflicting information, with implications for safer, more reliable RAG systems in real-world applications.

Abstract

Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when multi-hop reasoning is required -- and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.

MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation

TL;DR

This work introduces MAGIC, a knowledge-graph–based benchmark for inter-context knowledge conflicts in retrieval-augmented generation. It combines subgraph extraction from Wikidata5M, systematic conflict generation via few-shot LLM prompting, and KG-to-text conversion to produce diverse, multi-hop scenarios with clear relational structure. Experiments across open-source and proprietary LLMs reveal that conflict detection remains challenging, especially for multi-hop cases, and localization of contradictions is often imperfect, motivating further work on reasoning, prompting strategies, and integration of diverse knowledge sources. The dataset and analyses provide a foundation for improving LLMs’ ability to reason across conflicting information, with implications for safer, more reliable RAG systems in real-world applications.

Abstract

Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when multi-hop reasoning is required -- and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.

Paper Structure

This paper contains 55 sections, 22 figures, 17 tables.

Figures (22)

  • Figure 1: Example of a three-hop conflict from our benchmark, MAGIC. Even advanced LLMs struggle to detect subtle inconsistencies across two contexts, such as conflicting release orders of two songs.
  • Figure 2: Overview of the proposed KG-based framework for benchmarking inter-context knowledge conflict detection. It comprises three steps: (1) Subgraph Extraction, (2) Knowledge Conflict Generation, and (3) KG-to-Text Conversion, with details listed in §\ref{['sec:MAGIC']}.
  • Figure 3: Distribution of conflict types across three knowledge conflict detection datasets. MAGIC demonstrates greater diversity and complexity than the others.
  • Figure 4: Four distinct types of conflicts in MAGIC.
  • Figure 5: Prompt for generating multi-hop conflicts.
  • ...and 17 more figures