MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok; David Luz; Yacine Haddam

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok, David Luz, Yacine Haddam

TL;DR

MetaRAG tackles hallucinations in Retrieval-Augmented Generation by offering a reference-free, black-box metamorphic testing framework. It decomposes answers into factoids, mutations them with synonyms and antonyms, verifies each variant against retrieved context, and aggregates a response-level score $H(Q,A,C)$ defined as the maximum over per-fact scores $S_i$, where $S_i$ is the average penalty across $2N$ variants. The authors implement a prototype and evaluate it on a proprietary enterprise dataset, performing ablation and Pareto-front analyses to characterize trade-offs between accuracy and efficiency. Importantly, MetaRAG supports identity-aware safeguards by localizing unsupported claims at the factoid level and exposing spans for policy-driven responses, escalation, and citations, enabling safer deployment in high-stakes settings.

Abstract

Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

TL;DR

Abstract

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)