Table of Contents
Fetching ...

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Zhongxing Xu, Zhonghua Wang, Zhe Qian, Dachuan Shi, Feilong Tang, Ming Hu, Shiyan Su, Xiaocheng Zou, Wei Feng, Dwarikanath Mahapatra, Yifan Peng, Mingquan Lin, Zongyuan Ge

Abstract

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Abstract

Recent advancements in multimodal large reasoning models (MLRMs) have significantly improved performance in visual question answering. However, we observe that transition words (e.g., because, however, and wait) are closely associated with hallucinations and tend to exhibit high-entropy states. We argue that adequate contextual reasoning information can be directly extracted from the token probability distribution. Inspired by superposed representation theory, we propose leveraging latent superposed reasoning to integrate multiple candidate semantics and maintain latent reasoning trajectories. The hypothesis is that reliance on discrete textual inputs may drive the model toward sequential explicit reasoning, underutilizing dense contextual cues during high-entropy reasoning stages. Therefore, we propose constructing rich semantic representations from the token probability distributions to enhance in-context reasoning. With this goal, we present Latent Entropy-Aware Decoding (LEAD), an efficient plug-and-play decoding strategy that leverages semantic context to achieve reliable reasoning. The heart of our method lies in entropy-aware reasoning mode switching. The model employs probability-weighted continuous embeddings under high-entropy states and transitions back to discrete token embeddings as entropy decreases. Moreover, we propose a prior-guided visual anchor injection strategy that encourages the model to focus on visual information. Extensive experiments show that LEAD effectively mitigates hallucinations across various MLRMs on multiple benchmarks.
Paper Structure (31 sections, 9 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 9 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustrations of the correlation between hallucinations and transition words. In MLRMs, hallucinations tend to emerge more frequently after transition words, and these cases constitute a significant proportion of the overall hallucination occurrences.
  • Figure 2: Visualizations of token entropy during the reasoning phase show that tokens with higher entropy often correspond to transition words, consistent with our previous findings.
  • Figure 3: (a) Performance gap when masking different types of token during reasoning. Masking high-entropy tokens produces a larger performance drop than other tokens. (b) Token masking impact across reasoning steps. Earlier tokens tend to have stronger influence on the final answer, while the influence of later ones gradually diminishes. (c) Schematic depiction of reasoning paths at different states. (d) Token density comparisons. On average, high-entropy tokens without hallucinations exhibit higher visual attention ratios compared to hallucinated ones.
  • Figure 4: Illustration of multimodal reasoning and entropy-aware decoding. The model receives both visual and textual tokens (left) and generates responses by integrating contextual information. During reasoning, token-level entropy $H_t$ measures model confidence and is compared with the reference entropy $\hat{H}$. High-entropy states (orange) trigger latent decoding, using probability-weighted embeddings to preserve semantic diversity, while low-entropy states (blue) activate discrete decoding, using sampled tokens for precise semantic convergence. This adaptive switching mechanism balances exploration and commitment in multimodal reasoning.
  • Figure 5: Comparisons of average score on MMHalu and Bingo datasets under different entropy thresholds. $\Delta$ denotes the dynamic thresholding strategy in LEAD. $\infty$ keeps the model in standard discrete CoT reasoning, while 0 keeps it in latent reasoning.
  • ...and 5 more figures