Table of Contents
Fetching ...

Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

Hao Guo, Jianfei Zhu, Wei Fan, Chunzhi Yi, Feng Jiang

TL;DR

This work tackles visual grounding under natural human-robot interaction by moving beyond object-category descriptions to multi-attribute references that include user states, derived intentions, and embodied gestures. It introduces Multi-ref EC, a framework for grounding based on state-intention-gesture cues, and SIGAR, a novel dataset with free-form state, intention, and embodied reference annotations built on YouRefIt. The authors establish strong baselines using end-to-end REC models and model-combination with multimodal LLMs, and perform extensive ablations to reveal how attribute type, attribute pairing, and prompt ordering affect localization. The findings demonstrate the necessity of integrating multi-attribute references for robust visual-language grounding in real-world HRI and position SIGAR as a valuable benchmark for advancing multimodal reasoning in grounding tasks.

Abstract

Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.

Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding

TL;DR

This work tackles visual grounding under natural human-robot interaction by moving beyond object-category descriptions to multi-attribute references that include user states, derived intentions, and embodied gestures. It introduces Multi-ref EC, a framework for grounding based on state-intention-gesture cues, and SIGAR, a novel dataset with free-form state, intention, and embodied reference annotations built on YouRefIt. The authors establish strong baselines using end-to-end REC models and model-combination with multimodal LLMs, and perform extensive ablations to reveal how attribute type, attribute pairing, and prompt ordering affect localization. The findings demonstrate the necessity of integrating multi-attribute references for robust visual-language grounding in real-world HRI and position SIGAR as a valuable benchmark for advancing multimodal reasoning in grounding tasks.

Abstract

Referring expression comprehension (REC) aims at achieving object localization based on natural language descriptions. However, existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions, hindering their application in real-world scenarios. In natural human-robot interactions, users often express their desires through individual states and intentions, accompanied by guiding gestures, rather than detailed object descriptions. To address this challenge, we propose Multi-ref EC, a novel task framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects. We introduce the State-Intention-Gesture Attributes Reference (SIGAR) dataset, which combines state and intention expressions with embodied references. Through extensive experiments with various baseline models on SIGAR, we demonstrate that properly ordered multi-attribute references contribute to improved localization performance, revealing that single-attribute reference is insufficient for natural human-robot interaction scenarios. Our findings underscore the importance of multi-attribute reference expressions in advancing visual-language understanding.

Paper Structure

This paper contains 22 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of visual grounding paradigms: Affordance Detection (AD), Task-Driven Object Detection (TDOD), Referring Expression Comprehension (REC), Intention-Oriented Object Detection (IOOD), Intention-driven VG (IVG), and our proposed Multi-ref EC. Red, green, and blue text in Multi-ref EC represent state, intention, and gesture references, respectively.
  • Figure 2: Illustration of the data collection for Multi-ref EC. The process begins with inheriting YouRefIt data, followed by generating state-intention drafts using Claude-3.5-sonnet with vision-language input. We then manually filter for well-matched expressions and apply data augmentation to create semantically equivalent variations.
  • Figure 3: SIGAR dataset statistics. (a) the word cloud of state expression. (b) the word cloud of intention expression. (c) the word cloud of object category.
  • Figure 4: State word clouds of partial categories from SIGAR dataset. (a) bag. (b) phone. (c) remote.
  • Figure 5: Intention word clouds of partial categories from SIGAR dataset. (a) bag. (b) phone. (c) remote.
  • ...and 1 more figures