Table of Contents
Fetching ...

Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

Ling Li, Bowen Liu, Zinuo Zhan, Peng Jie, Jianhui Zhong, Kenglun Chang, Zhidong Deng

Abstract

Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric Vision

Abstract

Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.

Paper Structure

This paper contains 39 sections, 4 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Our dataset features images captured from an egocentric perspective where a hand is pointing towards a target object. The dataset includes four core annotations: Hand Annotation, Target Object Annotation, Image Caption, and Visual QA Pairs. While conventional Referring Expression Comprehension (REC) tasks only support language-assisted localization, our proposed task aims to address referring disambiguation by utilizing physical hand information as opposed to traditional linguistic assistance for target grounding.
  • Figure 2: Statistical overview and distribution analysis of the Ego-Point Ground dataset. The dataset encompasses a broad spectrum of everyday objects, accompanied by semantically rich and lexically diverse textual annotations.
  • Figure 3: Analysis of Cross-Modal Quality and Semantic Diversity. Figure (a) shows the visual-textual semantic alignment heatmap. The high diagonal similarity confirms strong semantic consistency between images and descriptions. Figure (b) displays the category semantic distribution scatter plot. Image description embeddings are visualized in 2D using PCA (color-coded by category). The points are uniformly distributed without significant clustering, demonstrating the diversity of the dataset's semantic space.
  • Figure 4: Overview of the Ego-Point Ground Dataset Collection and Multi-Stage Annotation Pipeline. Stages 1–3 illustrate the raw data acquisition and fundamental annotation workflow. Stage 4 details the extraction of 2D keypoint pose information. Subsequently, Stage 5 focuses on generating high-quality Visual Question-Answering (VQA) pairs for the target objects.
  • Figure 5: Image Examples of Ego-Point Ground Dataset. We showcase example images and partial annotations from the dataset, including bounding boxes for the hand and the target object, as well as the corresponding caption for the target object. The first row specifically illustrates data collected from our real-world captures.
  • ...and 4 more figures