Table of Contents
Fetching ...

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, Min Zhang

TL;DR

This work targets grounded multimodal named entity recognition (GMNER) and identifies modality bias in end-to-end MLLM-based approaches, where models rely on unimodal shortcuts rather than cross-modal verification. It introduces Modality-aware Consistency Reasoning (MCR), combining Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO) to enforce explicit, constraint-faithful cross-modal reasoning. Through experiments on GMNER, MNER-MI, and GREC, MCR consistently outperforms baselines, reduces visual and textual biases, and demonstrates improved grounding accuracy and reasoning reliability. The framework offers a principled, end-to-end alternative to cascaded pipelines, with practical implications for robust multimodal grounding and knowledge extraction.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

TL;DR

This work targets grounded multimodal named entity recognition (GMNER) and identifies modality bias in end-to-end MLLM-based approaches, where models rely on unimodal shortcuts rather than cross-modal verification. It introduces Modality-aware Consistency Reasoning (MCR), combining Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO) to enforce explicit, constraint-faithful cross-modal reasoning. Through experiments on GMNER, MNER-MI, and GREC, MCR consistently outperforms baselines, reduces visual and textual biases, and demonstrates improved grounding accuracy and reasoning reliability. The framework offers a principled, end-to-end alternative to cascaded pipelines, with practical implications for robust multimodal grounding and knowledge extraction.

Abstract

Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit , including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning (), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
Paper Structure (61 sections, 31 equations, 8 figures, 5 tables)

This paper contains 61 sections, 31 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Error patterns caused by modality bias in GMNER due to the model's tendency to hallucinate correlations based on unimodal heuristics rather than rigorous cross-modal verification.
  • Figure 2: The Framework of MCR. The framework consists of two stages: (1) Multi-style Reasoning Schema Injection constructs diverse reasoning schema $\mathcal{D}_{\mathcal{R}}$ by treating the core constraints as reasoning criteria and generating multiple reasoning styles from templates, LLMs, and MLLMs based on the image–text inputs and labels. A subset of $\mathcal{D}_{\mathcal{R}}$ is injected into MLLMs through supervised fine-tuning. (2) Constraint-guided Verifiable Optimization uses the remaining of $\mathcal{D}_{\mathcal{R}}$ and optimizes the model with verifiable reward functions derived from the core constraints, together with the GRPO algorithm, to enhance cross-modal consistency reasoning.
  • Figure 3: Quantitative results of textual bias. MCR effectively improves the models' ability to determine whether an entity is present, which in turn indicates that MCR mitigates textual bias.
  • Figure 4: Effect of Multi-style vs. Single-style Reasoning Schema on F1 and Reward Scores in CVO on Qwen2.5VL-7B (Left) and MimoVL-7B (Right). Single means single-style reasoning schema and Multi means multi-style reasoning schema.
  • Figure 5: Effect of Multi-style vs. Single-style Reasoning Schema on Cross Entropy (Left) and Mean Completion Length (Right) in CVO on MimoVL-7B.
  • ...and 3 more figures