Table of Contents
Fetching ...

RopeTP: Global Human Motion Recovery via Integrating Robust Pose Estimation with Diffusion Trajectory Prior

Mingjiang Liang, Yongkang Cheng, Hualin Liang, Shaoli Huang, Wei Liu

TL;DR

RopeTP is a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos, and outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

Abstract

We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

RopeTP: Global Human Motion Recovery via Integrating Robust Pose Estimation with Diffusion Trajectory Prior

TL;DR

RopeTP is a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos, and outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

Abstract

We present RopeTP, a novel framework that combines Robust pose estimation with a diffusion Trajectory Prior to reconstruct global human motion from videos. At the heart of RopeTP is a hierarchical attention mechanism that significantly improves context awareness, which is essential for accurately inferring the posture of occluded body parts. This is achieved by exploiting the relationships with visible anatomical structures, enhancing the accuracy of local pose estimations. The improved robustness of these local estimations allows for the reconstruction of precise and stable global trajectories. Additionally, RopeTP incorporates a diffusion trajectory model that predicts realistic human motion from local pose sequences. This model ensures that the generated trajectories are not only consistent with observed local actions but also unfold naturally over time, thereby improving the realism and stability of 3D human motion reconstruction. Extensive experimental validation shows that RopeTP surpasses current methods on two benchmark datasets, particularly excelling in scenarios with occlusions. It also outperforms methods that rely on SLAM for initial camera estimates and extensive optimization, delivering more accurate and realistic trajectories.

Paper Structure

This paper contains 13 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The upper part of the figure compares the occlusion handling in human shape recovery between the PARE method and our proposed method. Panel (a) displays the original image with two different shaped occluders. Panels (b) and (c) show the estimation results of PARE and our method under the two occlusion scenarios. The occlusion sensitivity map on the far right quantifies the maximum joint estimation error for each pixel position of the occluder, highlighting the robustness of the proposed method against various shape occlusions. Unlike PARE, which lacks global trajectory information, our method is capable of regenerating reasonable global trajectories while reconstructing robust poses. As illustrated in the lower half of the figure, the robust poses provide a powerful prior for trajectory generation, resulting in human motion trajectories that are nearly consistent with the video.
  • Figure 2: The overall structure of RopeTP. Here, Rope reconstructs the video results frame by frame. Meanwhile, TrajDenoiser regenerates the global trajectory of this sequence.
  • Figure 3: Overview of our Rope module. Given an RGB image as input, our method uses the Hierarchical Attention Guided Tokenizer to obtain multi-scale part and body features, combined into token sequences. The Adaptive Contextual Part Regressor optimizes scale-specific visual cues using a two-layer self-attention mechanism. Finally, we propose an efficient inter-hierarchical cross-attention layer to interact with visible information across scales and regress the SMPL parameters.
  • Figure 4: The visualization of body hierarchy levels.
  • Figure 5: Qualitative comparison on 3DPW and in-the-wild datasets. (a) Input images. (b) Results by PARE kocabas2021pare. (c) Results by CLIFF li2022cliff. (d) Results by Ours.
  • ...and 1 more figures