Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju, Sougata Sen, Sanjay E. Sarma, Archan Misra
TL;DR
This work tackles 3D-Embodied Reference Understanding (3D-ERU), where a target object must be identified from a combination of natural language, a pointing gesture, and a 3D scene. It introduces Imputer, an automated data augmentation framework, and ImputeRefer, a gesture-rich benchmark derived from existing 3D-REC data. The core contribution is Ges3ViG, a unified model that localizes the human, interprets the pointing gesture, and fuses gestural and linguistic cues through a multi-stage fusion pipeline, achieving substantial gains over gesture-free and prior gesture-based methods. The results demonstrate the practical value of jointly modeling human localization with referential grounding, and the authors release both the dataset and code to advance 3D-ERU research.
Abstract
3-Dimensional Embodied Reference Understanding (3D-ERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves ~30% improvement in accuracy as compared to other 3D-ERU models and ~9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.
