Table of Contents
Fetching ...

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok, David Luz, Yacine Haddam

TL;DR

MetaRAG tackles hallucinations in Retrieval-Augmented Generation by offering a reference-free, black-box metamorphic testing framework. It decomposes answers into factoids, mutations them with synonyms and antonyms, verifies each variant against retrieved context, and aggregates a response-level score $H(Q,A,C)$ defined as the maximum over per-fact scores $S_i$, where $S_i$ is the average penalty across $2N$ variants. The authors implement a prototype and evaluate it on a proprietary enterprise dataset, performing ablation and Pareto-front analyses to characterize trade-offs between accuracy and efficiency. Importantly, MetaRAG supports identity-aware safeguards by localizing unsupported claims at the factoid level and exposing spans for policy-driven responses, escalation, and citations, enabling safer deployment in high-stakes settings.

Abstract

Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

TL;DR

MetaRAG tackles hallucinations in Retrieval-Augmented Generation by offering a reference-free, black-box metamorphic testing framework. It decomposes answers into factoids, mutations them with synonyms and antonyms, verifies each variant against retrieved context, and aggregates a response-level score defined as the maximum over per-fact scores , where is the average penalty across variants. The authors implement a prototype and evaluate it on a proprietary enterprise dataset, performing ablation and Pareto-front analyses to characterize trade-offs between accuracy and efficiency. Importantly, MetaRAG supports identity-aware safeguards by localizing unsupported claims at the factoid level and exposing spans for policy-driven responses, escalation, and citations, enabling safer deployment in high-stakes settings.

Abstract

Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG's span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

Paper Structure

This paper contains 27 sections, 3 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Standard Retrieval-Augmented Generation (RAG) workflow. A user query is encoded into a vector representation using an embedding model and queried against a vector database constructed from a document corpus. The most relevant document chunks are retrieved and appended to the original query, which is then provided as input to a large language model (LLM) to generate the final response.
  • Figure 2: Overview of the MetaRAG workflow. (A) Integration of MetaRAG into a standard RAG pipeline: given a user question, the RAG retrieves context and generates an answer, which is then passed to MetaRAG for hallucination detection. (B) Internal MetaRAG pipeline: the answer is decomposed into atomic factoids, each factoid is mutated through synonym and antonym substitutions, and verified against the retrieved context using entailment/contradiction checks. Penalties are assigned to inconsistencies, and scores are aggregated into a response-level hallucination score.
  • Figure 3: Evaluation metrics for all 26 MetaRAG configurations.
  • Figure 4: Pareto front analysis for hallucination detection performance. Each point represents a MetaRAG configuration; Pareto-optimal points (non-dominated) are highlighted. Subplots show: (Left) F1 vs. average token usage, (Center) F1 vs. average total execution time, (Right) Precision vs. Recall. Pareto-optimal points represent configurations with no strictly better alternative in both accuracy and cost. Configuration IDs correspond to Table \ref{['tab:leaderboards']}.
  • Figure 5: Token length distributions of generated answers (left) and retrieved context passages (right).
  • ...and 2 more figures