Table of Contents
Fetching ...

Lyapunov Probes for Hallucination Detection in Large Foundation Models

Bozhi Luan, Gen Li, Yalan Qin, Jifeng Guo, Yun Zhou, Faguo Wu, Hongwei Zheng, Wenjun Wu, Zhaoxin Fan

TL;DR

This work proposes Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations, which reliably distinguish between stable factual regions and unstable, hallucination-prone regions.

Abstract

We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.

Lyapunov Probes for Hallucination Detection in Large Foundation Models

TL;DR

This work proposes Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations, which reliably distinguish between stable factual regions and unstable, hallucination-prone regions.

Abstract

We address hallucination detection in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) by framing the problem through the lens of dynamical systems stability theory. Rather than treating hallucination as a straightforward classification task, we conceptualize (M)LLMs as dynamical systems, where factual knowledge is represented by stable equilibrium points within the representation space. Our main insight is that hallucinations tend to arise at the boundaries of knowledge-transition regions separating stable and unstable zones. To capture this phenomenon, we propose Lyapunov Probes: lightweight networks trained with derivative-based stability constraints that enforce a monotonic decay in confidence under input perturbations. By performing systematic perturbation analysis and applying a two-stage training process, these probes reliably distinguish between stable factual regions and unstable, hallucination-prone regions. Experiments on diverse datasets and models demonstrate consistent improvements over existing baselines.
Paper Structure (16 sections, 5 equations, 5 figures, 4 tables)

This paper contains 16 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of representation space partition in large models. We define that the data (representation) space can be divided into three regions: (a) stable known regions, (b) stable unknown regions, and (c) unstable knowledge boundary regions. Hallucinations primarily emerge in the unstable boundary regions.
  • Figure 2: Overview of our Lyapunov Probing framework. Multi-layer hidden states and perturbation information are fed into a probe network comprising a transformer-based HiddenProcessor and an MLP-based Classifier. The framework is trained with stability-driven objectives that enforce monotonic confidence decay, enabling distinction between stable factual regions and unstable hallucination-prone regions.
  • Figure 3: Case analysis of our Lyapunov Probe. The probe score provides an accurate prediction of whether the model can answer correctly, enabling the system to either proceed with or abstain from generating a response, thereby effectively reducing hallucinations.
  • Figure 4: Verification of the Lyapunov property across 4 models. Compared with previous probes, the outputs of our Lyapunov probe decrease monotonically with increasing perturbations, enabling more distinct differentiation of hallucinations.
  • Figure 5: Performance comparison among layers from different parts of the model. The performance exhibits significant fluctuations, while the multi-layer fusion method we propose consistently achieves the highest performance.