Table of Contents
Fetching ...

A Multimodal Depth-Aware Method For Embodied Reference Understanding

Fevziye Irem Eyiokur, Dogucan Yaman, Hazım Kemal Ekenel, Alexander Waibel

TL;DR

This work tackles Embodied Reference Understanding (ERU), where identifying a referent object requires integrating language and pointing gestures in cluttered scenes. It introduces a depth-aware, dual-model framework: M_aug trained with LLM-based text augmentation to improve linguistic robustness, and M_depth that uses depth maps as an additional modality, both fused in a transformer-based backbone. A novel Depth-Aware Decision Module (DADM) then combines top predictions from both models by leveraging the predicted pointing line to select the most plausible referent, with OpenPose used to estimate eye and finger coordinates. The approach achieves state-of-the-art results on YouRefIt and strong performance on the ISL pointing dataset, with ablations showing that the combination of augmentation, depth, and distance-to-line decision yields robust disambiguation in challenging scenes, significantly advancing practical ERU for human–robot interaction.

Abstract

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

A Multimodal Depth-Aware Method For Embodied Reference Understanding

TL;DR

This work tackles Embodied Reference Understanding (ERU), where identifying a referent object requires integrating language and pointing gestures in cluttered scenes. It introduces a depth-aware, dual-model framework: M_aug trained with LLM-based text augmentation to improve linguistic robustness, and M_depth that uses depth maps as an additional modality, both fused in a transformer-based backbone. A novel Depth-Aware Decision Module (DADM) then combines top predictions from both models by leveraging the predicted pointing line to select the most plausible referent, with OpenPose used to estimate eye and finger coordinates. The approach achieves state-of-the-art results on YouRefIt and strong performance on the ISL pointing dataset, with ablations showing that the combination of augmentation, depth, and distance-to-line decision yields robust disambiguation in challenging scenes, significantly advancing practical ERU for human–robot interaction.

Abstract

Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.

Paper Structure

This paper contains 6 sections, 1 equation, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overall framework. Depth and LLM Augmentation modules are used interchangeably with proposed parallel models.
  • Figure 2: Visual comparisons on YouRefIt dataset (first two rows) and ISL pointing dataset (last two rows).