Table of Contents
Fetching ...

TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

Shenxu Chang, Junchi Yu, Weixing Wang, Yongqiang Chen, Jialin Yu, Philip Torr, Jindong Gu

TL;DR

TraceDet addresses hallucination detection in diffusion LLMs by leveraging intermediate denoising steps as an action trace and applying an information bottleneck to identify a minimal, informative sub-trace. It combines a Transformer-based sub-trace extractor with a classifier trained on the extracted trace, optimized via a practical objective that pairs a cross-entropy loss with a trace-extraction regularizer. Experiments on open-source D-LLMs across multiple QA benchmarks show substantial AUROC gains (average around 15.2%) over strong baselines, with robustness to generation length, step length, and remasking strategies. The work advances safe deployment of diffusion-based LLMs by providing a scalable, decoding-trace–driven detection framework and sheds light on the multi-step dynamics underlying hallucinations.

Abstract

Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

TraceDet: Hallucination Detection from the Decoding Trace of Diffusion Large Language Models

TL;DR

TraceDet addresses hallucination detection in diffusion LLMs by leveraging intermediate denoising steps as an action trace and applying an information bottleneck to identify a minimal, informative sub-trace. It combines a Transformer-based sub-trace extractor with a classifier trained on the extracted trace, optimized via a practical objective that pairs a cross-entropy loss with a trace-extraction regularizer. Experiments on open-source D-LLMs across multiple QA benchmarks show substantial AUROC gains (average around 15.2%) over strong baselines, with robustness to generation length, step length, and remasking strategies. The work advances safe deployment of diffusion-based LLMs by providing a scalable, decoding-trace–driven detection framework and sheds light on the multi-step dynamics underlying hallucinations.

Abstract

Diffusion large language models (D-LLMs) have recently emerged as a promising alternative to auto-regressive LLMs (AR-LLMs). However, the hallucination problem in D-LLMs remains underexplored, limiting their reliability in real-world applications. Existing hallucination detection methods are designed for AR-LLMs and rely on signals from single-step generation, making them ill-suited for D-LLMs where hallucination signals often emerge throughout the multi-step denoising process. To bridge this gap, we propose TraceDet, a novel framework that explicitly leverages the intermediate denoising steps of D-LLMs for hallucination detection. TraceDet models the denoising process as an action trace, with each action defined as the model's prediction over the cleaned response, conditioned on the previous intermediate output. By identifying the sub-trace that is maximally informative to the hallucinated responses, TraceDet leverages the key hallucination signals in the multi-step denoising process of D-LLMs for hallucination detection. Extensive experiments on various open source D-LLMs demonstrate that TraceDet consistently improves hallucination detection, achieving an average gain in AUROC of 15.2% compared to baselines.

Paper Structure

This paper contains 23 sections, 9 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of representative D-LLM hallucination patterns extracted by TraceDet. Left: Interleaving Hallucination, where the model decodes both truthful and hallucinated content. Middle: Inconsistent Guesses, where multiple contradictory keywords lead to hallucination. Right: Persistent Error, where the model maintains a hallucinated answer throughout denoising. Hallucinations are highlighted with red.
  • Figure 2: Illustration of TraceDet. During denoising, a diffusion LLM generates intermediate sequences along with token-level entropy traces, where highlighted words indicate the retained tokens after remasking (left). The sub-instance extractor $g_\theta$ produces a temporal mask $M$ to focus on informative steps, and the predictor $f_\phi$ classifies whether the final response is hallucinated (right).
  • Figure 3: Comparison of averaged trace entropy selected by different model variants. No masking refers to the full time step traces. (a) Comparison between different model variants. (b) Comparison between different masking ratios $\tau$. Results are reported using Dream-7B-Instruct.
  • Figure 4: (a) TraceDet performance of different generation lengths with step length fixed at 1. (b) TraceDet performance with different step lengths with generation length fixed at 128. All results are reported as AUROC using Dream-7B-Instruct.
  • Figure 5: (a) TraceDet performance sensitivity to remasking strategies. (b) TraceDet performance sensitivity to $\mathcal{L}_{ext}$ parameters $\tau$ and $\beta$ on TriviaQA. All results are reported as AUROC using Dream-7B-Instruct.