SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Ziqiang Shi; Rujie Liu; Shanshan Yu; Satoshi Munakata; Koichi Shirahata

SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata

TL;DR

SchröMind tackles hallucinations in multimodal language models by learning a token-level activation correction that maps hallucinatory attention to truthful attention through the Schrödinger bridge problem (SBP) with entropy-regularized OT. The method identifies influential attention heads and applies either static or dynamic SBP-driven corrections, implemented via a Gaussian-mixture potential and a lightweight training regime. Empirical results on POPE and MME demonstrate state-of-the-art hallucination reduction with minimal computational overhead, while preserving the models' multimodal capabilities. This approach offers a practical, data-efficient path to safer, more reliable vision-language systems for high-stakes applications.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model's original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.

SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

TL;DR

Abstract

Paper Structure (11 sections, 11 equations, 3 figures, 2 tables)

This paper contains 11 sections, 11 equations, 3 figures, 2 tables.

Introduction
SchröMind
Preliminaries and Notation
Defining Hallucination Mitigation in SchröMind
Hallucination-to-Truth Manifold Mapping
Hallucination Mitigation Pipeline for MLLMs
Experiments
Benchmarks and Evaluation Protocol
Results and discussion
Ablation study
Conclusion

Figures (3)

Figure 1: Schematic of SchröMind. MLLMs process correct and hallucinated responses to extract attention activations. Classifiers then identify critical attention heads and analyze token wise influences. SBP/EOT learns a distribution mapping to mitigate hallucinations.
Figure 2: A comparison of the hallucination and reliable manifolds in LLaVA-1.5-7B and Qwen2.5-VL-7B is presented through t-SNE visualizations at image and object levels, with transformations between manifolds illustrated via SBP trajectories.
Figure 3: SchröMind outperforms prior SOTA models (ICT, regular LLaVA-1.5 and Qwen2.5-VL) on MME, with gains in key areas: existence, position, counting, color perception, commonsense reasoning, and overall performance.

SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

TL;DR

Abstract

SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Authors

TL;DR

Abstract

Table of Contents

Figures (3)