Table of Contents
Fetching ...

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe

TL;DR

The paper addresses semantic hallucination in large multimodal models when spotting and understanding scene text, where models rely on semantic priors rather than actual visual grounding. It analyzes layer-wise attention dynamics to reveal that stronger grounding in intermediate layers reduces hallucinations and introduces a training-free mitigation framework consisting of ZoomText for coarse-to-fine text localization and Grounded Layer Correction to fuse representations from the most grounded layer. A new benchmark, TextHalu-Bench, is proposed to rigorously evaluate non-semantic text grounding across spotting and understanding tasks. Empirical results show consistent improvements across multiple LMMs and standard OCR/VL benchmarks, demonstrating the method’s generalizability and potential for more reliable scene-text reasoning in real-world applications.

Abstract

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of 1,740 samples spanning both semantic and non-semantic cases, with manually curated question answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

TL;DR

The paper addresses semantic hallucination in large multimodal models when spotting and understanding scene text, where models rely on semantic priors rather than actual visual grounding. It analyzes layer-wise attention dynamics to reveal that stronger grounding in intermediate layers reduces hallucinations and introduces a training-free mitigation framework consisting of ZoomText for coarse-to-fine text localization and Grounded Layer Correction to fuse representations from the most grounded layer. A new benchmark, TextHalu-Bench, is proposed to rigorously evaluate non-semantic text grounding across spotting and understanding tasks. Empirical results show consistent improvements across multiple LMMs and standard OCR/VL benchmarks, demonstrating the method’s generalizability and potential for more reliable scene-text reasoning in real-world applications.

Abstract

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of 1,740 samples spanning both semantic and non-semantic cases, with manually curated question answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

Paper Structure

This paper contains 21 sections, 6 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: (a) LMMs hallucinate scene-text answers by relying on semantic priors rather than grounding in the actual visual content. For instance, when we edit "MOTEL" and "PULL" to "MMOTEL" and "PULLa", the models still answer the original ones. (b) and (c) illustrate the performance of LMMs on OCRBench and ICDAR 2015, with separate evaluations on semantic and non-semantic text samples.
  • Figure 2: Visualization of the hallucination‐analysis pipeline. For each input image, we (1) identify hallucinated text tokens and compute their layer‐wise hallucination‐tendency scores, (2) calculate the ratio of the ground‑truth text score to the hallucinated text score for each layer, and (3) overlay these normalized ratios onto the corresponding attention maps. We observe that layers with a lower propensity to hallucinate concentrate their attention more strongly on the text regions.
  • Figure 3: Visualization of the ZoomText process and examples.
  • Figure 4: (a) Examples of TextHalu-Bench. (b) Comparison of non-semantic answers ratios between existing scene text benchmarks and TextHalu-Bench. SQ, UQ, and ANS represent spotting, understanding questions, and answers, respectively.
  • Figure 5: Ablation on the Grounded Layer Correction. (Left) Different layer selection method. (Right) Different correction strategy. "Base": Baseline; "Repla.": Replacement; "S-Repla.": Selective Replacement; "Fuse": Fusion.
  • ...and 4 more figures