Table of Contents
Fetching ...

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

Mouad Abrini, Mohamed Chetouani

TL;DR

CLUE introduces an ambiguity-aware Interactive Visual Grounding pipeline that converts cross-modal attention into a spatial signal to decide when to ask clarifying questions. The approach jointly trains an IVG decoder with LoRA adapters and an explicit ambiguity detector that operates on attention maps, enabling end-to-end disambiguation using InViG data and synthetic real-world data. Key results show a strong ambiguity detector (layer-14 signals) and CLUE-augmented IVG outperforming a state-of-the-art from-scratch baseline on InViG, with good generalization to real-world data. The work demonstrates that attention-based, spatial grounding signals provide interpretable, efficient cues for when to query and how to localize confusion, with practical impact for real-time human-robot interaction.

Abstract

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue

CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion

TL;DR

CLUE introduces an ambiguity-aware Interactive Visual Grounding pipeline that converts cross-modal attention into a spatial signal to decide when to ask clarifying questions. The approach jointly trains an IVG decoder with LoRA adapters and an explicit ambiguity detector that operates on attention maps, enabling end-to-end disambiguation using InViG data and synthetic real-world data. Key results show a strong ambiguity detector (layer-14 signals) and CLUE-augmented IVG outperforming a state-of-the-art from-scratch baseline on InViG, with good generalization to real-world data. The work demonstrates that attention-based, spatial grounding signals provide interpretable, efficient cues for when to query and how to localize confusion, with practical impact for real-time human-robot interaction.

Abstract

With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue
Paper Structure (25 sections, 9 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 9 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Problem illustration: when an instruction is underspecified, the robot should detect it and ask for clarification (AI generated, then edited)
  • Figure 2: Overall CLUE architecture. An RGB image is encoded by SigLIP and projected by an MLP. The text prefix is tokenized and passed with the image tokens into a Gemma2 decoder equipped with LoRA adapters. The decoder both (i) autoregressively generates clarification questions and (ii) exposes cross-modal attention maps from mid-layers to a lightweight CNN ambiguity detect (Fig. \ref{['fig:detectarch']}). If the detector predicts ambiguous, the model asks a follow-up and updates the dialog context; otherwise it emits location tokens and triggers object detection. The loop repeats until disambiguation.
  • Figure 3: IVG with notation. An RGB image (AI generated in this example) is encoded by SigLIP and projected to the decoder via an MLP, the text prefix is tokenized and concatenated with image tokens. The Gemma2 LLM takes $X^{(1)}="<image>\text{clarify} "\Vert C^{(1)}$, where the running context after the first turn is $C^{(1)}=U\Vert R_1\Vert H_1$ (initial user request $U$, assistant question $R_1$, human reply $H_1$). The model autoregressively generates either a clarification segment (left stream, labeled $R_1$) or location tokens $G=<\text{locXXXX}>...$ (right stream) that decode to a bounding box. Green blocks denote image features, pink blocks denote text tokens.
  • Figure 4: Ambiguity detector. The image is encoded by SigLIP and projected with an MLP. The text prefix (e.g. "Get the apple") is tokenized and passed to the Gemma2 decoder. From the 14th layer, we read text to image cross-attention for the selected prefix query tokens $Q$ and keep the first 1024 keys (image tokens), yielding per-token 32$\times$32 maps that are mean-aggregated into a single spatial map. This map is passed to a lightweight CNN to produce the ambiguity probability $p_{amb}$.
  • Figure 5: Text$\rightarrow$ image attention maps from layer 14 ($32\times 32$ patches) for different instructions. (a) Input image. (b) "detect the apple" yields two peaks over both apples (ambiguous). (c) "detect the red apple" concentrates on the left apple. (d) "detect apple on the right" concentrates on the right apple. Colors show min-max normalized attention weight.
  • ...and 5 more figures