CLUE: Crossmodal disambiguation via Language-vision Understanding with attEntion
Mouad Abrini, Mohamed Chetouani
TL;DR
CLUE introduces an ambiguity-aware Interactive Visual Grounding pipeline that converts cross-modal attention into a spatial signal to decide when to ask clarifying questions. The approach jointly trains an IVG decoder with LoRA adapters and an explicit ambiguity detector that operates on attention maps, enabling end-to-end disambiguation using InViG data and synthetic real-world data. Key results show a strong ambiguity detector (layer-14 signals) and CLUE-augmented IVG outperforming a state-of-the-art from-scratch baseline on InViG, with good generalization to real-world data. The work demonstrates that attention-based, spatial grounding signals provide interpretable, efficient cues for when to query and how to localize confusion, with practical impact for real-time human-robot interaction.
Abstract
With the increasing integration of robots into daily life, human-robot interaction has become more complex and multifaceted. A critical component of this interaction is Interactive Visual Grounding (IVG), through which robots must interpret human intentions and resolve ambiguity. Existing IVG models generally lack a mechanism to determine when to ask clarification questions, as they implicitly rely on their learned representations. CLUE addresses this gap by converting the VLM's cross-modal attention into an explicit, spatially grounded signal for deciding when to ask. We extract text to image attention maps and pass them to a lightweight CNN to detect referential ambiguity, while a LoRA fine-tuned decoder conducts the dialog and emits grounding location tokens. We train on a real-world interactive dataset for IVG, and a mixed ambiguity set for the detector. With InViG-only supervision, our model surpasses a state-of-the-art method while using parameter-efficient fine-tuning. Similarly, the ambiguity detector outperforms prior baselines. Overall, CLUE turns the internal cross-modal attention of a VLM into an explicit, spatially grounded signal for deciding when to ask. The data and code are publicly available at: mouadabrini.github.io/clue
