Target-Oriented Object Grasping via Multimodal Human Guidance

Pengwei Xie; Siang Chen; Dingchang Hu; Yixiang Dai; Kaiqin Yang; Guijin Wang

Target-Oriented Object Grasping via Multimodal Human Guidance

Pengwei Xie, Siang Chen, Dingchang Hu, Yixiang Dai, Kaiqin Yang, Guijin Wang

TL;DR

This work tackles target-oriented grasping in human–robot interaction by introducing a two-module pipeline: Multimodal Guidance Module (MGM) and Target-Oriented Grasp Network (TOGNet). TOGNet generates $6$-DoF grasps from region-centered patches, guided by language, gestures, or clicks, enabling efficient, localized grasp predictions that streamline downstream planning. The approach is trained on a region-focal dataset derived from GraspNet-1Billion and evaluated via a new Target-oriented Average Precision metric, with strong performance in both simulation (up to $+13.7\%$ over baselines) and real-world experiments across multiple guidance modes. The results demonstrate improved grasp quality, reduced computation, and robust operation in cluttered, target-focused tasks, highlighting practical impact for assistive HRI/C scenarios. Limitations include segmentation reliability and handling of complex instruction sets, with future work pointing toward integrating more advanced vision-language models and deploying on constrained devices.

Abstract

In the context of human-robot interaction and collaboration scenarios, robotic grasping still encounters numerous challenges. Traditional grasp detection methods generally analyze the entire scene to predict grasps, leading to redundancy and inefficiency. In this work, we reconsider 6-DoF grasp detection from a target-referenced perspective and propose a Target-Oriented Grasp Network (TOGNet). TOGNet specifically targets local, object-agnostic region patches to predict grasps more efficiently. It integrates seamlessly with multimodal human guidance, including language instructions, pointing gestures, and interactive clicks. Thus our system comprises two primary functional modules: a guidance module that identifies the target object in 3D space and TOGNet, which detects region-focal 6-DoF grasps around the target, facilitating subsequent motion planning. Through 50 target-grasping simulation experiments in cluttered scenes, our system achieves a success rate improvement of about 13.7%. In real-world experiments, we demonstrate that our method excels in various target-oriented grasping scenarios.

Target-Oriented Object Grasping via Multimodal Human Guidance

TL;DR

-DoF grasps from region-centered patches, guided by language, gestures, or clicks, enabling efficient, localized grasp predictions that streamline downstream planning. The approach is trained on a region-focal dataset derived from GraspNet-1Billion and evaluated via a new Target-oriented Average Precision metric, with strong performance in both simulation (up to

over baselines) and real-world experiments across multiple guidance modes. The results demonstrate improved grasp quality, reduced computation, and robust operation in cluttered, target-focused tasks, highlighting practical impact for assistive HRI/C scenarios. Limitations include segmentation reliability and handling of complex instruction sets, with future work pointing toward integrating more advanced vision-language models and deploying on constrained devices.

Abstract

Paper Structure (22 sections, 5 equations, 8 figures, 4 tables)

This paper contains 22 sections, 5 equations, 8 figures, 4 tables.

Introduction
Related Work
Method
Overview
Multimodal Guidance Module
Language Instructions
Pointing Gestures
Interactive Clicks
Target-Oriented Grasp Network
Region-focal Grasp Detection
Multimodal De-differentiation
Grasp Predictor
Region-focal Dataset Generation
Losses
EXPERIMENT
...and 7 more sections

Figures (8)

Figure 1: Our Target-Oriented Grasp Network (TOGNet) is designed to integrate seamlessly with various forms of multimodal human guidance. By cropping and analyzing the RGB-D information within the target area, TOGNet detects high-quality 6-DoF grasps. Then the robot selects the most suitable grasp for execution. The integrated system enables applications in diverse HRI/C scenarios, providing useful assistance to people with visual, auditory, or motor impairments.
Figure 2: System Overview. Taking a monocular RGB-D image as input, the Multimodal Guidance Module (MGM) processes various types of guidance (e.g., language instructions, pointing gestures, and clicks) to locate target regions and sample points as region centers. The neighboring points around the centers are then clustered into multiple local patches. TOGNet then extracts geometric features, predicts grasps locally and transforms them back to the original scene.
Figure 3: Problem formulation of region-focal grasp detection. A: The patch is cropped from the input RGB-D image. B: The local patches are transformed and normalized. C: The regional grasp representation as $(\Delta \mathbf{t}, \theta, \beta, \gamma, w)$.
Figure 4: Detailed structure of proposed TOGNet.
Figure 5: The pipeline of region-focal dataset generation. Grasp centers are sampled using the Gaussian-based strategy. Then, local neighboring points around each center are cropped as patches. Only the grasp labels within a radius of $r$ from the patch center are preserved.
...and 3 more figures

Target-Oriented Object Grasping via Multimodal Human Guidance

TL;DR

Abstract

Target-Oriented Object Grasping via Multimodal Human Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (8)