Table of Contents
Fetching ...

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu

TL;DR

This work addresses the challenge of long-horizon, intent-conditioned 3D hand trajectory prediction from egocentric video by introducing the EgoMAN dataset and a modular EgoMAN model. The framework couples a Reasoning Module with a Motion Expert through a trajectory-token interface and employs progressive training to align semantic intent with physically grounded motion via Flow Matching. EgoMAN achieves state-of-the-art accuracy and generalization across in-domain and out-of-distribution scenes, while maintaining high efficiency, highlighting the value of explicit interaction-stage reasoning for real-world manipulation tasks. The dataset's rich QA supervision and the token-based interface enable robust, controllable motion generation, with broad implications for robot learning from human demonstrations and language-conditioned motion synthesis.

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

TL;DR

This work addresses the challenge of long-horizon, intent-conditioned 3D hand trajectory prediction from egocentric video by introducing the EgoMAN dataset and a modular EgoMAN model. The framework couples a Reasoning Module with a Motion Expert through a trajectory-token interface and employs progressive training to align semantic intent with physically grounded motion via Flow Matching. EgoMAN achieves state-of-the-art accuracy and generalization across in-domain and out-of-distribution scenes, while maintaining high efficiency, highlighting the value of explicit interaction-stage reasoning for real-world manipulation tasks. The dataset's rich QA supervision and the token-based interface enable robust, controllable motion generation, with broad implications for robot learning from human demonstrations and language-conditioned motion synthesis.

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

Paper Structure

This paper contains 32 sections, 6 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: EgoMAN project. We introduce 1) the EgoMAN dataset (top), a large-scale egocentric dataset for interaction stage–aware 3D hand trajectory prediction with 219K 6-DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. During inference, 2) the EgoMAN model (bottom) takes an image, past hand motion, and an intent query as input, performs stage-aware reasoning to infer intent-specific waypoints, and then generates 6-DoF hand trajectories of distinct motions for different intent queries.
  • Figure 2: Overview of the EgoMAN model. The EgoMAN model is a modular reasoning-to-motion framework that predicts future 6DoF hand trajectories from an egocentric RGB frame, past wrist trajectories, and a language intent. The Reasoning Module (a), built on QwenVL-7B, extracts semantic and spatial features and outputs trajectory tokens with waypoints and intent semantic cues. The Motion Expert (b), using Flow Matching, predicts future trajectories based on waypoints, past motion, intent semantics and visual input. The trajectory tokens of (a) form the Trajectory-Token Interface which replaces semantic and waypoint condition inputs of (b) to bridge from Reasoning to Motion Expert.
  • Figure 2: Waypoint prediction results. Lower is better for Contact and Traj; higher is better for FPS (averaged over 50 samples on an NVIDIA PG509-210, 80GB). EgoMAN-WP achieves the best accuracy, improving Contact by 33.8% and Traj by 52.8% on EgoMAN-Unseen, and runs orders of magnitude faster at 3.45 FPS.
  • Figure 3: Qualitative comparisons on EgoMAN-Bench. We visualize best-of-$K{=}10$ predictions for waypoints and full trajectories. Left: <CONTACT> and <END> waypoint predictions compared with VRB* and VidBot. Right: 3D hand trajectory forecasts and 2D projections compared with prior baselines. Our EgoMAN model produces the smoothest and closest results to ground truth.
  • Figure 4: Qualitative results of diverse activities. EgoMAN generates accurate 6DoF hand trajectories for diverse activities, aligning motion with the intent description and scene spatial.
  • ...and 6 more figures