FlowHOI: Flow-based Semantics-Grounded Generation of Hand-Object Interactions for Dexterous Robot Manipulation
Huajian Zeng, Lingyun Chen, Jiaqi Yang, Yuantai Zhang, Fan Shi, Peidong Liu, Xingxing Zuo
TL;DR
FlowHOI introduces a semantics-grounded, embodiment-agnostic HOI generation framework that uses a two-stage flow-matching approach to produce temporally coherent hand-object interactions conditioned on an egocentric view, language, and 3D scene context. By decoupling geometry-centric grasping from semantics-centric manipulation and grounding the latter with compact 3D scene tokens and a motion-text alignment loss, FlowHOI achieves strong semantic alignment, physical plausibility, and neural-inference efficiency. A reconstruction pipeline from egocentric videos provides a robust HOI prior, enabling robust generalization across objects and tasks, with state-of-the-art results on GRAB and HOT3D and successful real-robot retargeting on four dexterous tasks. The work advances embodiment-agnostic manipulation by providing a transferable HOI script that can be integrated into downstream planning and control pipelines, while acknowledging limitations in occlusion handling and dynamics modeling for future work.
Abstract
Recent vision-language-action (VLA) models can generate plausible end-effector motions, yet they often fail in long-horizon, contact-rich tasks because the underlying hand-object interaction (HOI) structure is not explicitly represented. An embodiment-agnostic interaction representation that captures this structure would make manipulation behaviors easier to validate and transfer across robots. We propose FlowHOI, a two-stage flow-matching framework that generates semantically grounded, temporally coherent HOI sequences, comprising hand poses, object poses, and hand-object contact states, conditioned on an egocentric observation, a language instruction, and a 3D Gaussian splatting (3DGS) scene reconstruction. We decouple geometry-centric grasping from semantics-centric manipulation, conditioning the latter on compact 3D scene tokens and employing a motion-text alignment loss to semantically ground the generated interactions in both the physical scene layout and the language instruction. To address the scarcity of high-fidelity HOI supervision, we introduce a reconstruction pipeline that recovers aligned hand-object trajectories and meshes from large-scale egocentric videos, yielding an HOI prior for robust generation. Across the GRAB and HOT3D benchmarks, FlowHOI achieves the highest action recognition accuracy and a 1.7$\times$ higher physics simulation success rate than the strongest diffusion-based baseline, while delivering a 40$\times$ inference speedup. We further demonstrate real-robot execution on four dexterous manipulation tasks, illustrating the feasibility of retargeting generated HOI representations to real-robot execution pipelines.
