Table of Contents
Fetching ...

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

Bohan Zhou, Yi Zhan, Zhongbin Zhang, Zongqing Lu

TL;DR

MEgoHand tackles the challenge of generating realistic egocentric hand-object interactions under unstable viewpoints and novel objects by unifying text, RGB, and initial MANO hand parameters into a bi-level architecture. The Cerebrum module leverages a vision-language model and monocular depth to form high-level priors, while the Cerebellum employs a DiT-based flow-matching policy with Temporal Orthogonal Filtering to produce stable, fine-grained hand trajectories. A unified dataset is curated via an Inverse MANO Retargeting Network and a Virtual RGB-D Renderer to overcome annotation inconsistencies, resulting in 3.35M RGB-D frames, 24K trajectories, and 1.2K objects. Across five in-domain and two cross-domain benchmarks, MEgoHand achieves state-of-the-art performance with substantial reductions in wrist translation and joint rotation errors, and after Procrustes alignment, further improvements in joint and mesh vertex accuracy, demonstrating strong generalization and practical potential for AR/VR and robotic imitation. The work highlights the value of integrating vision-language priors with 3D reasoning and flow-based generation, while pointing to future directions in data curation and real-world depth guidance.”

Abstract

Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.

MEgoHand: Multimodal Egocentric Hand-Object Interaction Motion Generation

TL;DR

MEgoHand tackles the challenge of generating realistic egocentric hand-object interactions under unstable viewpoints and novel objects by unifying text, RGB, and initial MANO hand parameters into a bi-level architecture. The Cerebrum module leverages a vision-language model and monocular depth to form high-level priors, while the Cerebellum employs a DiT-based flow-matching policy with Temporal Orthogonal Filtering to produce stable, fine-grained hand trajectories. A unified dataset is curated via an Inverse MANO Retargeting Network and a Virtual RGB-D Renderer to overcome annotation inconsistencies, resulting in 3.35M RGB-D frames, 24K trajectories, and 1.2K objects. Across five in-domain and two cross-domain benchmarks, MEgoHand achieves state-of-the-art performance with substantial reductions in wrist translation and joint rotation errors, and after Procrustes alignment, further improvements in joint and mesh vertex accuracy, demonstrating strong generalization and practical potential for AR/VR and robotic imitation. The work highlights the value of integrating vision-language priors with 3D reasoning and flow-based generation, while pointing to future directions in data curation and real-world depth guidance.”

Abstract

Egocentric hand-object motion generation is crucial for immersive AR/VR and robotic imitation but remains challenging due to unstable viewpoints, self-occlusions, perspective distortion, and noisy ego-motion. Existing methods rely on predefined 3D object priors, limiting generalization to novel objects, which restricts their generalizability to novel objects. Meanwhile, recent multimodal approaches suffer from ambiguous generation from abstract textual cues, intricate pipelines for modeling 3D hand-object correlation, and compounding errors in open-loop prediction. We propose MEgoHand, a multimodal framework that synthesizes physically plausible hand-object interactions from egocentric RGB, text, and initial hand pose. MEgoHand introduces a bi-level architecture: a high-level "cerebrum" leverages a vision language model (VLM) to infer motion priors from visual-textual context and a monocular depth estimator for object-agnostic spatial reasoning, while a low-level DiT-based flow-matching policy generates fine-grained trajectories with temporal orthogonal filtering to enhance stability. To address dataset inconsistency, we design a dataset curation paradigm with an Inverse MANO Retargeting Network and Virtual RGB-D Renderer, curating a unified dataset of 3.35M RGB-D frames, 24K interactions, and 1.2K objects. Extensive experiments across five in-domain and two cross-domain datasets demonstrate the effectiveness of MEgoHand, achieving substantial reductions in wrist translation error (86.9%) and joint rotation error (34.1%), highlighting its capacity to accurately model fine-grained hand joint structures and generalize robustly across diverse scenarios.

Paper Structure

This paper contains 28 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: MEgoHand stands as the starting point for generating high-quality motion sequences of hand-object interactions, conditioned on egocentric RGB images, textual instructions, and given initial MANO hand parameters.
  • Figure 1: We forward the MANO model to convert the outputs of Inverse MANO Retargeting Network $\phi$ to hand meshes, which are projected to the original frames in FPHA with the help of camera intrinsics and extrinsics.
  • Figure 2: During inference, the system prompt and task instruction are encoded using a frozen VLM tokenizer. At each timestep, an RGB image is processed by a pretrained depth estimator to obtain a metric depth map. The RGB and depth images are then combined and encoded into a visual embedding, which—together with the text embedding—is input to the frozen VLM. A DiT-based motion generator receives this multimodal representation along with the initial hand parameters to predict relative future hand motion. During training, the depth encoder, VLM vision encoder, and DiT head are finetuned.
  • Figure 2: Illustration for smoothing predicted transformations.
  • Figure 3: The evaluation of our two methods and two baseline variants on five in-domain (H2O, HOI4D, HOT3D, OAKINK2, TACO) and two cross-domain datasets (ARCTIC, HOLO), using MPJPE as metric (unit: cm, lower is better).
  • ...and 6 more figures