TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang; Ziyun Wang; Lingjie Liu; Kostas Daniilidis

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Yufu Wang, Ziyun Wang, Lingjie Liu, Kostas Daniilidis

TL;DR

TRAM addresses the challenge of recovering a global world-space 3D human motion from in-the-wild videos with moving cameras by decoupling the camera trajectory in the world frame from the body motion in the camera frame, enabling composition to obtain world-space motion in $SE(3)$. It robustifies monocular SLAM against dynamic humans using dual masking and grounds metric scale via background-depth cues from $D_i$ and $D_i$ through a robust scale term $\alpha$, leveraging $\alpha d_i$ to align with $D_i$ (scale estimation). A novel video transformer, VIMO, extends a large pre-trained HMR2.0 model with two temporal transformers to enforce temporal coherence across image-domain patches and SMPL pose tokens, trained with losses $L_{2D}$, $L_{3D}$, $L_{SMPL}$, and $L_V$. Empirically, TRAM achieves large reductions in global trajectory errors and state-of-the-art body-motion reconstruction on benchmarks like 3DPW, EMDB, and BEDLAM, demonstrating practical potential for real-world, long-range human motion understanding in world coordinates.

Abstract

We propose TRAM, a two-stage method to reconstruct a human's global trajectory and motion from in-the-wild videos. TRAM robustifies SLAM to recover the camera motion in the presence of dynamic humans and uses the scene background to derive the motion scale. Using the recovered camera as a metric-scale reference frame, we introduce a video transformer model (VIMO) to regress the kinematic body motion of a human. By composing the two motions, we achieve accurate recovery of 3D humans in the world space, reducing global motion errors by a large margin from prior work. https://yufu-wang.github.io/tram4d/

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

TL;DR

. It robustifies monocular SLAM against dynamic humans using dual masking and grounds metric scale via background-depth cues from

and

through a robust scale term

, leveraging

to align with

(scale estimation). A novel video transformer, VIMO, extends a large pre-trained HMR2.0 model with two temporal transformers to enforce temporal coherence across image-domain patches and SMPL pose tokens, trained with losses

, and

. Empirically, TRAM achieves large reductions in global trajectory errors and state-of-the-art body-motion reconstruction on benchmarks like 3DPW, EMDB, and BEDLAM, demonstrating practical potential for real-world, long-range human motion understanding in world coordinates.

Abstract

Paper Structure (13 sections, 4 equations, 6 figures, 5 tables)

This paper contains 13 sections, 4 equations, 6 figures, 5 tables.

Introduction
Related Work
Method
Preliminary: 3D Human Model
Masked DROID-SLAM
Trajectory Scale Estimation
Video Transformer for Human Motion
Experiments
Comparison on Camera Trajectory Recovery
Comparison on Human Trajectory Recovery
Comparison on Human Body Motion Reconstruction
Limitations
Conclusions

Figures (6)

Figure 1: Overview. Given an in-the-wild video, TRAM reconstructs the complete 3D human motion: global trajectory and local body motion, in diverse and long-range scenarios.
Figure 2: Overview of TRAM. Top-left: given a video, we first recover the relative camera motion and scene depth with DROID-SLAM, which we robustify with dual masking (Sec. \ref{['sec:3_2']}). Top-right: we align the recovered depth to metric depth prediction with an optimization procedure to estimate metric scaling (Sec. \ref{['sec:3_3']}). Bottom: We introduce VIMO to reconstruct the 3D human in the camera coordinate (Sec. \ref{['sec:3_4']}), and use the metric-scale camera to convert the human trajectory and body motion to the global coordinate.
Figure 3: Video transformer VIMO builds on top of the large pre-trained HMR2.0 and adds two temporal transformers to propagate information across video frames. Right: the temporal transformers use the same encoder-only architecture. represents patch tokens at the same spatial location across time in the first temporal module, and represents SMPL poses across time in the second temporal module. More details are included in the supplementary.
Figure 4: Camera trajectory estimation. With dynamic humans in the scene, the default DROID-SLAM tends to diverge. The two-step masking makes it robust. In addition, our procedure estimates a reasonable metric scale for the cameras.
Figure 5: Human global trajectory on EMDB. Compared to WHAM, our method produces less drift and a more accurate scale for complex and long-range tracks.
...and 1 more figures

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

TL;DR

Abstract

TRAM: Global Trajectory and Motion of 3D Humans from in-the-wild Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (6)