Table of Contents
Fetching ...

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo

TL;DR

FlowHOI introduces a semantics-grounded, embodiment-agnostic HOI generation framework that uses a two-stage flow-matching approach to produce temporally coherent hand-object interactions conditioned on an egocentric view, language, and 3D scene context. By decoupling geometry-centric grasping from semantics-centric manipulation and grounding the latter with compact 3D scene tokens and a motion-text alignment loss, FlowHOI achieves strong semantic alignment, physical plausibility, and neural-inference efficiency. A reconstruction pipeline from egocentric videos provides a robust HOI prior, enabling robust generalization across objects and tasks, with state-of-the-art results on GRAB and HOT3D and successful real-robot retargeting on four dexterous tasks. The work advances embodiment-agnostic manipulation by providing a transferable HOI script that can be integrated into downstream planning and control pipelines, while acknowledging limitations in occlusion handling and dynamics modeling for future work.

Abstract

Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.

FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation

TL;DR

FlowHOI introduces a semantics-grounded, embodiment-agnostic HOI generation framework that uses a two-stage flow-matching approach to produce temporally coherent hand-object interactions conditioned on an egocentric view, language, and 3D scene context. By decoupling geometry-centric grasping from semantics-centric manipulation and grounding the latter with compact 3D scene tokens and a motion-text alignment loss, FlowHOI achieves strong semantic alignment, physical plausibility, and neural-inference efficiency. A reconstruction pipeline from egocentric videos provides a robust HOI prior, enabling robust generalization across objects and tasks, with state-of-the-art results on GRAB and HOT3D and successful real-robot retargeting on four dexterous tasks. The work advances embodiment-agnostic manipulation by providing a transferable HOI script that can be integrated into downstream planning and control pipelines, while acknowledging limitations in occlusion handling and dynamics modeling for future work.

Abstract

Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7 higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40 inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
Paper Structure (34 sections, 42 equations, 12 figures, 4 tables)

This paper contains 34 sections, 42 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We present a method for generating hand-object interaction (HOI) motions conditioned on egocentric observation, text command, and 3D scene context. We first learn a grasping prior with HOI data extracted from large-scale egocentric videos, and then generate semantically grounded manipulation motions that respect language instructions as well as the surrounding 3D scene context and geometric constraints. The generated motions can be retargeted to robot hands for real-world execution.
  • Figure 2: Overview of our framework. Given an egocentric observation, text command, and 3D scene context, our method generates hand-object interaction motions through a two-stage pipeline: (1) a grasping stage that generates hand motion to approach and grasp the object, fine-tuned by reconstructed high-fidelity hand-object interaction data from large-scale egocentric videos, and (2) a manipulation stage that generates the subsequent interaction conditioned on scene and language.
  • Figure 3: Hand-object data reconstruction pipeline. Given an egocentric RGB video, we detect the grasp-to-manipulation transition frame from wrist motion cues, reconstruct the 3D object mesh from pre-transition frames via segmentation and metric depth estimation, and align the MANO hand mesh with the object under contact and non-penetration constraints to produce an aligned HOI sequence. See supplementary material for the detailed pipeline.
  • Figure 4: Qualitative comparison of HOI generation. We compare our method with DiffH2O christen2024diffh2o and LatentHOI li2025latenthoi against ground truth (GT). Top row: results on the GRAB dataset. Bottom row: results on the HOT3D dataset in a 3D scene context. Our method generates more natural grasping poses and physically plausible manipulations that better align with the input action instructions and comply with the surrounding 3D scene layout. Best seen in the supplementary video.
  • Figure 5: Showcase of real-world robot applications. We retarget our generated HOI sequence to a Franka Panda arm with Allegro Hand for four contact-rich manipulation tasks: pouring, drinking, tilting, and squeezing. The robot successfully executes contact-rich interactions guided by our HOI sequence.
  • ...and 7 more figures