Pointing-Based Object Recognition

Lukáš Hajdúch; Viktor Kocur

Pointing-Based Object Recognition

Lukáš Hajdúch, Viktor Kocur

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

Pointing-Based Object Recognition

Abstract

Paper Structure (16 sections, 1 equation, 2 figures, 3 tables)

This paper contains 16 sections, 1 equation, 2 figures, 3 tables.

Introduction
Related Work
Proposed Methodology
Object Detection Branch
Human Pose Recognition
Depth Estimation
Object Recognition
Gesture Recognition
Image Captioning
Dataset and Experimental Setup
Results and Discussion
Gesture Recognition
Object Recognition
Performance of Image Captioning
Limitations
...and 1 more sections

Figures (2)

Figure 1: The full pipeline of the proposed system. The input image (a) is fed into three parallel branches: pose estimation (b), object detection (c) and depth estimation (d). The outputs of the three branches get combined to first classify whether pointing occurs and which detected object is pointed at (e). Finally, an image captioning module (f) is used to provide a better description of the object and to potentially fix an incorrect object label assignment.
Figure 2: Sample images from our dataset. We use different types of gestures to indicate pointing at objects and neutral poses (no pointing). The dataset captures different levels of complexity in terms of item positions.

Pointing-Based Object Recognition

Abstract

Pointing-Based Object Recognition

Authors

Abstract

Table of Contents

Figures (2)