Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
Mingfei Chen, Yifan Wang, Zhengqin Li, Homanga Bharadhwaj, Yujin Chen, Chuan Qin, Ziyi Kou, Yuan Tian, Eric Whitmire, Rajinder Sodhi, Hrvoje Benko, Eli Shlizerman, Yue Liu
TL;DR
This work addresses the challenge of long-horizon, intent-conditioned 3D hand trajectory prediction from egocentric video by introducing the EgoMAN dataset and a modular EgoMAN model. The framework couples a Reasoning Module with a Motion Expert through a trajectory-token interface and employs progressive training to align semantic intent with physically grounded motion via Flow Matching. EgoMAN achieves state-of-the-art accuracy and generalization across in-domain and out-of-distribution scenes, while maintaining high efficiency, highlighting the value of explicit interaction-stage reasoning for real-world manipulation tasks. The dataset's rich QA supervision and the token-based interface enable robust, controllable motion generation, with broad implications for robot learning from human demonstrations and language-conditioned motion synthesis.
Abstract
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
