Table of Contents
Fetching ...

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

Yitian Shi, Zicheng Guo, Rosa Wolf, Edgar Welte, Rania Rayyes

TL;DR

HOGraspFlow addresses the challenge of retargeting human hand-object interactions to parallel-jaw grasps without object geometry by conditioning SE(3) grasp generation on RGB-based HOI features, hand contact, and grasp taxonomy priors. It introduces two generative frameworks, HOGraspDiff (score matching) and HOGraspFlow (flow matching), and demonstrates that flow matching yields higher distributional fidelity and stability under guidance. The method achieves object-agnostic, multi-modal grasp synthesis from a single RGB frame, with real-world success rates exceeding $83\%$ on a UR10e platform, and outperforms diffusion-based baselines and contact-only conditioning. These results underscore the practicality of vision-based HOI-informed retargeting for in-the-wild manipulation without explicit 3D object models.

Abstract

We propose Hand-Object\emph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize $SE(3)$ grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in $SE(3)$. We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over $83\%$ is achieved. Code: https://github.com/YitianShi/HOGraspFlow

HOGraspFlow: Taxonomy-Aware Hand-Object Retargeting for Multi-Modal SE(3) Grasp Generation

TL;DR

HOGraspFlow addresses the challenge of retargeting human hand-object interactions to parallel-jaw grasps without object geometry by conditioning SE(3) grasp generation on RGB-based HOI features, hand contact, and grasp taxonomy priors. It introduces two generative frameworks, HOGraspDiff (score matching) and HOGraspFlow (flow matching), and demonstrates that flow matching yields higher distributional fidelity and stability under guidance. The method achieves object-agnostic, multi-modal grasp synthesis from a single RGB frame, with real-world success rates exceeding on a UR10e platform, and outperforms diffusion-based baselines and contact-only conditioning. These results underscore the practicality of vision-based HOI-informed retargeting for in-the-wild manipulation without explicit 3D object models.

Abstract

We propose Hand-Object\emph{(HO)GraspFlow}, an affordance-centric approach that retargets a single RGB with hand-object interaction (HOI) into multi-modal executable parallel jaw grasps without explicit geometric priors on target objects. Building on foundation models for hand reconstruction and vision, we synthesize grasp poses with denoising flow matching (FM), conditioned on the following three complementary cues: RGB foundation features as visual semantics, HOI contact reconstruction, and taxonomy-aware prior on grasp types. Our approach demonstrates high fidelity in grasp synthesis without explicit HOI contact input or object geometry, while maintaining strong contact and taxonomy recognition. Another controlled comparison shows that \emph{HOGraspFlow} consistently outperforms diffusion-based variants (\emph{HOGraspDiff}), achieving high distributional fidelity and more stable optimization in . We demonstrate a reliable, object-agnostic grasp synthesis from human demonstrations in real-world experiments, where an average success rate of over is achieved. Code: https://github.com/YitianShi/HOGraspFlow

Paper Structure

This paper contains 25 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Grasp demonstrations for parallel jaw grippers via HOGraspFlow
  • Figure 2: Pipeline for HOGraspFlow
  • Figure 3: Denoising process and generation results. Vertices in contact are in red. Parameters for guidance: $\theta_{\rm thr} = 0.8$, $\lambda^{gd}=1e-3$.
  • Figure 4: Generated PJ grasps via region–conditioned contact matching with HOGraspNet annotations cho2024dense via hand regions defined by yang2021cpf.
  • Figure 5: Comparison of EMD histograms between HOGraspDiff and HOGraspFlow on HOGraspNet (in frequency).
  • ...and 1 more figures