HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

Yitian Shi; Zicheng Guo; Rosa Wolf; Edgar Welte; Rania Rayyes

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

Yitian Shi, Zicheng Guo, Rosa Wolf, Edgar Welte, Rania Rayyes

TL;DR

HOGraspFlow addresses the challenge of retargeting human hand-object interactions to parallel-jaw grasps without object geometry by conditioning SE(3) grasp generation on RGB-based HOI features, hand contact, and grasp taxonomy priors. It introduces two generative frameworks, HOGraspDiff (score matching) and HOGraspFlow (flow matching), and demonstrates that flow matching yields higher distributional fidelity and stability under guidance. The method achieves object-agnostic, multi-modal grasp synthesis from a single RGB frame, with real-world success rates exceeding $83\%$ on a UR10e platform, and outperforms diffusion-based baselines and contact-only conditioning. These results underscore the practicality of vision-based HOI-informed retargeting for in-the-wild manipulation without explicit 3D object models.

Abstract

We propose Hand-Object\emph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize $SE(3)$ grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in $SE(3)$. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over $83\%$ is achieved. Code: https://github.com/YitianShi/HOGraspFlow

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

TL;DR

on a UR10e platform, and outperforms diffusion-based baselines and contact-only conditioning. These results underscore the practicality of vision-based HOI-informed retargeting for in-the-wild manipulation without explicit 3D object models.

Abstract

grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in

. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over

is achieved. Code: https://github.com/YitianShi/HOGraspFlow

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

TL;DR

Abstract

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)