TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos
Yufu Wang, Ziyun Wang, Lingjie Liu, Kostas Daniilidis
TL;DR
TRAM addresses the challenge of recovering a global world-space 3D human motion from in-the-wild videos with moving cameras by decoupling the camera trajectory in the world frame from the body motion in the camera frame, enabling composition to obtain world-space motion in $SE(3)$. It robustifies monocular SLAM against dynamic humans using dual masking and grounds metric scale via background-depth cues from $D_i$ and $D_i$ through a robust scale term $\alpha$, leveraging $\alpha d_i$ to align with $D_i$ (scale estimation). A novel video transformer, VIMO, extends a large pre-trained HMR2.0 model with two temporal transformers to enforce temporal coherence across image-domain patches and SMPL pose tokens, trained with losses $L_{2D}$, $L_{3D}$, $L_{SMPL}$, and $L_V$. Empirically, TRAM achieves large reductions in global trajectory errors and state-of-the-art body-motion reconstruction on benchmarks like 3DPW, EMDB, and BEDLAM, demonstrating practical potential for real-world, long-range human motion understanding in world coordinates.
Abstract
We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by a large margin from prior work. https://yufu-wang.github.io/tram4d/
