MMD-Flagger: Leveraging Maximum Mean Discrepancy to Detect Hallucinations
Kensuke Mitsuzawa, Damien Garreau
TL;DR
This paper tackles the challenge of hallucinations in large language models by proposing MMD-Flagger, a white-box detector that does not rely on external models. The method generates a hypothesis and multiple stochastic samples across a grid of temperatures, then uses Maximum Mean Discrepancy to measure distributional distances between the hypothesis and samples, tracking an MMD trajectory to detect hallucinations. Empirical results on machine translation and summarization benchmarks show that MMD-Flagger achieves competitive recall and precision across datasets and kernel configurations, with robustness to vectorization choices. The approach offers a practical, self-contained tool for hallucination detection in critical applications, and points toward interpretability enhancements via kernel-based feature importance analyses.
Abstract
Large language models (LLMs) have become pervasive in our everyday life. Yet, a fundamental obstacle prevents their use in many critical applications: their propensity to generate fluent, human-quality content that is not grounded in reality. The detection of such hallucinations is thus of the highest importance. In this work, we propose a new method to flag hallucinated content: MMD-Flagger. It relies on Maximum Mean Discrepancy (MMD), a non-parametric distance between distributions. On a high-level perspective, MMD-Flagger tracks the MMD between the output to inspect and counterparts generated with various temperature parameters. We show empirically that inspecting the shape of this trajectory is sufficient to detect most hallucinations. This novel method is benchmarked on machine translation and summarization datasets, on which it exhibits competitive performance relative to natural competitors.
