Table of Contents
Fetching ...

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

Hao Guo, Wei Fan, Baichun Wei, Jianfei Zhu, Jin Tian, Chunzhi Yi, Feng Jiang

TL;DR

The Attention-Dynamic DINO is introduced, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts that achieves 76.3% accuracy at the 0.25 IoU threshold and surpasses human performance at the 0.75 IoU threshold, marking a first in this domain.

Abstract

Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

AD-DINO: Attention-Dynamic DINO for Distance-Aware Embodied Reference Understanding

TL;DR

The Attention-Dynamic DINO is introduced, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts that achieves 76.3% accuracy at the 0.25 IoU threshold and surpasses human performance at the 0.75 IoU threshold, marking a first in this domain.

Abstract

Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Example of pointing to the object that close to body. In this scenario, the girl is accurately pointing at the lipstick which is around her while the eyes deviate the connect line between fingertip and lipstick.
  • Figure 2: Language and visual inputs are first encoded into initial language features and visual features, respectively. Then the two features are enhanced by cross-modality fusion. After the features are enhanced, the language-guided query selection module is used to select cross-modal queries from visual features. Finally, cross-modality decoder is used to predict the object boxes and the attention source. Attention source and the fingertip predicted by the fingertip detector jointly represent the predicted ADTL.
  • Figure 3: Single layer for cross-modality fusion module.
  • Figure 4: Single layer for cross-modality decoder module.
  • Figure 5: The demonstration of distance-aware character of ADTL. Relative to distance-unaware VPT methods, ADTL can establish a more accurate alignment with referent based on interaction distance awareness. The ADTLs in (a, d, e) take the form of FL, while those in (b, c, f) take the form of VTL. Yellow and red rectangles represent the ground truth and ADTL predicted bounding boxes, respectively. Yellow arrows indicate the ADTLs, and blue lines represent the interpretation mode abandoned by the ADTL. The words in green on the first line are the natural language input, and those on the second line show the GIOU value between the ground truth bounding box and the predicted bounding box.
  • ...and 2 more figures