Table of Contents
Fetching ...

Pointing-Based Object Recognition

Lukáš Hajdúch, Viktor Kocur

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.

Pointing-Based Object Recognition

Abstract

This paper presents a comprehensive pipeline for recognizing objects targeted by human pointing gestures using RGB images. As human-robot interaction moves toward more intuitive interfaces, the ability to identify targets of non-verbal communication becomes crucial. Our proposed system integrates several existing state-of-the-art methods, including object detection, body pose estimation, monocular depth estimation, and vision-language models. We evaluate the impact of 3D spatial information reconstructed from a single image and the utility of image captioning models in correcting classification errors. Experimental results on a custom dataset show that incorporating depth information significantly improves target identification, especially in complex scenes with overlapping objects. The modularity of the approach allows for deployment in environments where specialized depth sensors are unavailable.
Paper Structure (16 sections, 1 equation, 2 figures, 3 tables)

This paper contains 16 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The full pipeline of the proposed system. The input image (a) is fed into three parallel branches: pose estimation (b), object detection (c) and depth estimation (d). The outputs of the three branches get combined to first classify whether pointing occurs and which detected object is pointed at (e). Finally, an image captioning module (f) is used to provide a better description of the object and to potentially fix an incorrect object label assignment.
  • Figure 2: Sample images from our dataset. We use different types of gestures to indicate pointing at objects and neutral poses (no pointing). The dataset captures different levels of complexity in terms of item positions.