Table of Contents
Fetching ...

HumanMM: Global Human Motion Recovery from Multi-shot Videos

Yuhong Zhang, Guanlin Wu, Ling-Hao Chen, Zhuokai Zhao, Jing Lin, Xiaoke Jiang, Jiamin Wu, Zhuoheng Li, Hao Frank Yang, Haoqian Wang, Lei Zhang

TL;DR

HumanMM addresses the problem of recovering long-sequence 3D human motion in world coordinates from multi-shot monocular videos. It introduces a pipeline that (i) detects shot transitions, (ii) estimates per-shot camera poses with Masked LEAP-VO, (iii) aligns orientation and pose across shots via an orientation alignment module and a multi-shot HMR encoder, and (iv) post-processes motion with a trajectory predictor/refiner to reduce foot sliding. A new ms-Motion multi-shot benchmark demonstrates state-of-the-art performance across global motion and orientation metrics, supported by strong ablations showing the necessity of each component. The work enables robust, continuous world-space motion recovery in unconstrained videos and provides a public dataset for benchmarking multi-shot HMR methods.

Abstract

In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.

HumanMM: Global Human Motion Recovery from Multi-shot Videos

TL;DR

HumanMM addresses the problem of recovering long-sequence 3D human motion in world coordinates from multi-shot monocular videos. It introduces a pipeline that (i) detects shot transitions, (ii) estimates per-shot camera poses with Masked LEAP-VO, (iii) aligns orientation and pose across shots via an orientation alignment module and a multi-shot HMR encoder, and (iv) post-processes motion with a trajectory predictor/refiner to reduce foot sliding. A new ms-Motion multi-shot benchmark demonstrates state-of-the-art performance across global motion and orientation metrics, supported by strong ablations showing the necessity of each component. The work enables robust, continuous world-space motion recovery in unconstrained videos and provides a public dataset for benchmarking multi-shot HMR methods.

Abstract

In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.

Paper Structure

This paper contains 29 sections, 34 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The comparison between the distribution of sequence lengths in different existing large-scale markerless motion datasets with ours. The $x$-axis and $y$-axis denote the duration time (s) and percentage of video number, respectively. Our dataset (in green) contains more portion of long-sequence videos in general.
  • Figure 2: The overview of HumanMM. HumanMM processes multi-shot video sequences by first extracting motion feature such as keypoints and bounding boxes, using ViTPose xu2022vitpose and image feature using ViT dosovitskiy2020vit. These features are then segmented into single-shot clips via Shot Transition Detection (\ref{['sec:shot_detector']}). Initialized camera (camera rotation $\mathbf{R}$ and camera translation $\mathbf{T}$) and human (SMPL) parameters for each shot are estimated using Masked LEAP-VO (\ref{['sec:initialization']}) and GVHMRshen2024gvhmr. Human orientation is aligned across shots through camera calibration (\ref{['sec:orientation_alignment']}), and ms-HMR (\ref{['sec:pose_alignment']}) ensures consistent pose alignment. Finally, a bi-directional LSTM-based trajectory predictor with trajectory refiner predicts trajectory based on aligned motion and mitigates foot sliding throughout the video.
  • Figure 3: Shot transition detection examples. Examples (a), (b), and (c) illustrate multi-shot scenarios in online videos. (a) shows scene transitions detectable by SceneDetect. (b) illustrates significant position changes undetectable by SceneDetect but resolvable with bbox tracking-based method. (c) shows pose or orientation transition, requiring pose tracking-based methods as they cannot be addressed by either SceneDetect or bbox tracking.
  • Figure 4: Human orientation alignment module. Following a shot transition after the foremost purple human mesh (shot ① captured by camera $C_0$), the unaligned (blue) and aligned (green) motions are captured as shot ② and shot "③" by camera $C_0^{'}$ and $C_1$, respectively. $C_0^{'} = C_0$. To achieve human orientation alignment from shot "①" to "③", the camera rotation matrix from $C_0^{'}$ to $C_1$ is computed and applied as the offset of human orientation.
  • Figure 5: ms-HMR Structure. The initial human pose parameters $\theta$ across multiple video shots are input into a transformer with shot-index-based positional encoding. This enables ms-HMR to generate consistent human poses across all shots in the video.
  • ...and 5 more figures