MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation
Jungyeon Lee, Kangmin Lee, Taeuk Kim
TL;DR
This work introduces MAGIC, a knowledge-graph–based benchmark for inter-context knowledge conflicts in retrieval-augmented generation. It combines subgraph extraction from Wikidata5M, systematic conflict generation via few-shot LLM prompting, and KG-to-text conversion to produce diverse, multi-hop scenarios with clear relational structure. Experiments across open-source and proprietary LLMs reveal that conflict detection remains challenging, especially for multi-hop cases, and localization of contradictions is often imperfect, motivating further work on reasoning, prompting strategies, and integration of diverse knowledge sources. The dataset and analyses provide a foundation for improving LLMs’ ability to reason across conflicting information, with implications for safer, more reliable RAG systems in real-world applications.
Abstract
Knowledge conflict often arises in retrieval-augmented generation (RAG) systems, where retrieved documents may be inconsistent with one another or contradict the model's parametric knowledge. Existing benchmarks for investigating the phenomenon have notable limitations, including a narrow focus on the question answering setup, heavy reliance on entity substitution techniques, and a restricted range of conflict types. To address these issues, we propose a knowledge graph (KG)-based framework that generates varied and subtle conflicts between two similar yet distinct contexts, while ensuring interpretability through the explicit relational structure of KGs. Experimental results on our benchmark, MAGIC, provide intriguing insights into the inner workings of LLMs regarding knowledge conflict: both open-source and proprietary models struggle with conflict detection -- especially when multi-hop reasoning is required -- and often fail to pinpoint the exact source of contradictions. Finally, we present in-depth analyses that serve as a foundation for improving LLMs in integrating diverse, sometimes even conflicting, information.
