Table of Contents
Fetching ...

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian, Mohammad Izadi, Mahdieh Soleymani Baghshah

TL;DR

This work proposes Grounding IDs, latent symbolic identifiers induced by simple external visual cues that bind objects to their corresponding partitions across image and text in LVLMs. Through observational probes, causal activation-swapping, and layerwise analyses, the paper demonstrates that Grounding IDs strengthen partition-specific cross-modal binding, reduce the modality gap, and mediate object–cue associations. Empirically, Grounding IDs improve visual reasoning tasks and substantially mitigate hallucinations during long-form generation, with applicability to both open and closed models. The findings offer interpretability into LVLM reasoning and suggest a practical, model-agnostic scaffolding approach to enhance grounding without additional inference modules.

Abstract

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.

Uncovering Grounding IDs: How External Cues Shape Multimodal Binding

TL;DR

This work proposes Grounding IDs, latent symbolic identifiers induced by simple external visual cues that bind objects to their corresponding partitions across image and text in LVLMs. Through observational probes, causal activation-swapping, and layerwise analyses, the paper demonstrates that Grounding IDs strengthen partition-specific cross-modal binding, reduce the modality gap, and mediate object–cue associations. Empirically, Grounding IDs improve visual reasoning tasks and substantially mitigate hallucinations during long-form generation, with applicability to both open and closed models. The findings offer interpretability into LVLM reasoning and suggest a practical, model-agnostic scaffolding approach to enhance grounding without additional inference modules.

Abstract

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding. Recent work has demonstrated that adding simple visual structures, such as partitions and annotations, improves accuracy, yet the internal mechanisms underlying these gains remain unclear. We investigate this phenomenon and propose the concept of Grounding IDs, latent identifiers induced by external cues that bind objects to their designated partitions across modalities. Through representation analysis, we find that these identifiers emerge as consistent within-partition alignment in embedding space and reduce the modality gap between image and text. Causal interventions further confirm that these identifiers mediate binding between objects and symbolic cues. We show that Grounding IDs strengthen attention between related components, which in turn improves cross-modal grounding and reduces hallucinations. Taken together, our results identify Grounding IDs as a key symbolic mechanism that explains how external cues enhance multimodal binding and offer both interpretability and practical improvements.

Paper Structure

This paper contains 50 sections, 9 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Conceptual overview of Grounding IDs. An input image is augmented with simple visual cues (e.g., {@, #, $}) and paired with a prompt that explicitly includes these symbols. Embeddings with the same Grounding IDs are displayed in matching colors across modalities, illustrating the reinforced binding between partitions and their corresponding textual descriptions.
  • Figure 2: Illustration of attention patterns under baseline and structured inputs in the scene description task. (a) One dataset sample (top: baseline, bottom: structured). (b) Within-modality visual attention matrices. (c) Cross-modality attention matrices.
  • Figure 3: Analysis of the modality gap. (a) Cross-modal alignment across layers, showing that improvements emerge in layers 22--27. (b) Average alignment in layers 22--27, reported separately for four partitions. Object embeddings under structured inputs achieve higher alignment than the baseline (dashed line), and symbol embeddings achieve even stronger alignment than objects.
  • Figure 4: Activation swap experiment. (a) Procedure in a case where source ($c'$) and target ($c$) contain the same objects. Activations from the & and @ partitions of $c'$ are patched into $c$, producing the patched context $c^*$. Predictions in $c^*$ follow the transferred bindings (gray) rather than host symbols. (b) Average log probabilities of $c$ and $c^*$ over valid row–symbol–object combinations. Rows and columns indicate the two selected query symbols and their corresponding objects.
  • Figure 5: Causal mediation analysis across layers. (a) Average logit differences between $\textbf{o}^{s}_{\sim s}$ and $\textbf{o}^{\sim s}_{s}$ across layers, showing where the model begins to favor the grounded object. (b) Signal-to-noise scores for attention differences between $\textbf{o}^{s}_{\sim s}$ and $\textbf{o}^{\sim s}_{s}$ patches across heads and layers.
  • ...and 12 more figures