Table of Contents
Fetching ...

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

Linhao Zhong, Linyu Wu, Wen Wang, Yuling Xi, Chenchen Jing, Jiaheng Zhang, Hao Chen, Chunhua Shen

TL;DR

This work proposes DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs that enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification.

Abstract

Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

TL;DR

This work proposes DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs that enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification.

Abstract

Diffusion large language models (dLLMs) have recently attracted significant attention for their ability to enhance diversity, controllability, and parallelism. However, their non-sequential, bidirectionally masked generation makes quality assessment difficult, underscoring the need for effective self-evaluation. In this work, we propose DiSE, a simple yet effective self-evaluation confidence quantification method for dLLMs. DiSE quantifies confidence by computing the probability of regenerating the tokens in the entire generated sequence, given the full context. This method enables more efficient and reliable quality assessment by leveraging token regeneration probabilities, facilitating both likelihood estimation and robust uncertainty quantification. Building upon DiSE, we further introduce a flexible-length generation framework, which adaptively controls the sequence length based on the model's self-assessment of its own output. We analyze and validate the feasibility of DiSE from the perspective of dLLM generalization, and empirically demonstrate that DiSE is positively correlated with both semantic coherence and answer accuracy. Extensive experiments on likelihood evaluation, uncertainty quantification, and flexible-length generation further confirm the effectiveness of the proposed DiSE.
Paper Structure (38 sections, 5 equations, 17 figures, 10 tables)

This paper contains 38 sections, 5 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: A simplified illustration of self-evaluation confidence quantification methods for clarity. (a) Monte Carlo simulation approach for dLLMs. A total of $N_{mc}$ simulations are performed. In the $i$-th simulation, a set of masked positions $\{mask^i_j\}$ is sampled. The tokens at these positions are replaced with mask tokens, and the model predicts the probability of correctly generating these tokens. The final estimation is obtained by aggregating the results across all $N_{mc}$ simulations. (b) The proposed DiSE for dLLMs. The set of selected positions $\{U_j\}$ is predefined. The model receives the entire sequence and estimates the regeneration probability of the tokens at $\{U_j\}$.
  • Figure 2: Generalization ability of dLLMs: Different tokens map from distinct start points to similar end points in the latent space.
  • Figure 3: Histogram and Cumulative distribution function (CDF) of GT token probability ranks.
  • Figure 4: Mean pairwise distribution distances for GT, mask, and random tokens using JS Divergence and Wasserstein Distance.
  • Figure 5: Differences between the DiSE scores of natural sentences and randomized sentences using the LLaDA-Instruct-8B model under four selection modes: 'full' (entire sentence), 'first-10' (first 10 tokens), 'mid-10' (10 tokens from the middle) and 'last-10' (last 10 tokens). Each subfigure contains 15 blocks, representing 15 sampled sentences. All blocks are shown in green (difference $>0$), indicating that natural sentences consistently achieve higher DiSE scores than randomized sentences.
  • ...and 12 more figures