Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding
Hao Guo, Jianfei Zhu, Wei Fan, Chunzhi Yi, Feng Jiang
TL;DR
This work tackles visual grounding under natural human-robot interaction by moving beyond object-category descriptions to multi-attribute references that include user states, derived intentions, and embodied gestures. It introduces Multi-ref EC, a framework for grounding based on state-intention-gesture cues, and SIGAR, a novel dataset with free-form state, intention, and embodied reference annotations built on YouRefIt. The authors establish strong baselines using end-to-end REC models and model-combination with multimodal LLMs, and perform extensive ablations to reveal how attribute type, attribute pairing, and prompt ordering affect localization. The findings demonstrate the necessity of integrating multi-attribute references for robust visual-language grounding in real-world HRI and position SIGAR as a valuable benchmark for advancing multimodal reasoning in grounding tasks.
Abstract
Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.
