MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation
Bohan Zhou, Yi Zhan, Zhongbin Zhang, Zongqing Lu
TL;DR
MEgoHand tackles the challenge of generating realistic egocentric hand-object interactions under unstable viewpoints and novel objects by unifying text, RGB, and initial MANO hand parameters into a bi-level architecture. The Cerebrum module leverages a vision-language model and monocular depth to form high-level priors, while the Cerebellum employs a DiT-based flow-matching policy with Temporal Orthogonal Filtering to produce stable, fine-grained hand trajectories. A unified dataset is curated via an Inverse MANO Retargeting Network and a Virtual RGB-D Renderer to overcome annotation inconsistencies, resulting in 3.35M RGB-D frames, 24K trajectories, and 1.2K objects. Across five in-domain and two cross-domain benchmarks, MEgoHand achieves state-of-the-art performance with substantial reductions in wrist translation and joint rotation errors, and after Procrustes alignment, further improvements in joint and mesh vertex accuracy, demonstrating strong generalization and practical potential for AR/VR and robotic imitation. The work highlights the value of integrating vision-language priors with 3D reasoning and flow-based generation, while pointing to future directions in data curation and real-world depth guidance.”
Abstract
Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.
