Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Hosein Hasani; Amirmohammad Izadi; Fatemeh Askari; Mobin Bagherian; Sadegh Mohammadian; Mohammad Izadi; Mahdieh Soleymani Baghshah

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

TL;DR

This work proposes Grounding IDs, latent symbolic identifiers induced by simple external visual cues that bind objects to their corresponding partitions across image and text in LVLMs. Through observational probes, causal activation-swapping, and layerwise analyses, the paper demonstrates that Grounding IDs strengthen partition-specific cross-modal binding, reduce the modality gap, and mediate object–cue associations. Empirically, Grounding IDs improve visual reasoning tasks and substantially mitigate hallucinations during long-form generation, with applicability to both open and closed models. The findings offer interpretability into LVLM reasoning and suggest a practical, model-agnostic scaffolding approach to enhance grounding without additional inference modules.

Abstract

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

TL;DR

Abstract

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)