Table of Contents
Fetching ...

RemoCap: Disentangled Representation Learning for Motion Capture

Hongsheng Wang, Lizao Zhang, Zhangnan Zhong, Shuolin Xu, Xinrui Zhou, Shengyu Zhang, Huahao Xu, Fei Wu, Feng Lin

TL;DR

RemoCap tackles occlusion in 3D human mesh reconstruction by disentangling spatial and motion features through two novel modules: Spatial Disentanglement (SD) and Motion Disentanglement (MD). SD isolates target-specific spatial cues within occluded scenes, while MD decouples motion features over time using sequence shuffling and temporal attention, aided by a sequence velocity loss to enforce temporal coherence. The method is trained with a combination of vertex, 3D-joint, and 2D-joint losses to optimize both geometry and motion consistency. Empirical results on 3DPW show state-of-the-art performance across intra-frame metrics MPVPE, MPJPE, and PA-MPJPE, highlighting strong occlusion handling and temporal stability, with competitive results on Human3.6M and clear qualitative improvements in challenging scenes. Overall, RemoCap introduces a practical, transformer-backed, model-free approach that advances robust 3D human mesh reconstruction in real-world, occluded environments.

Abstract

Reconstructing 3D human bodies from realistic motion sequences remains a challenge due to pervasive and complex occlusions. Current methods struggle to capture the dynamics of occluded body parts, leading to model penetration and distorted motion. RemoCap leverages Spatial Disentanglement (SD) and Motion Disentanglement (MD) to overcome these limitations. SD addresses occlusion interference between the target human body and surrounding objects. It achieves this by disentangling target features along the dimension axis. By aligning features based on their spatial positions in each dimension, SD isolates the target object's response within a global window, enabling accurate capture despite occlusions. The MD module employs a channel-wise temporal shuffling strategy to simulate diverse scene dynamics. This process effectively disentangles motion features, allowing RemoCap to reconstruct occluded parts with greater fidelity. Furthermore, this paper introduces a sequence velocity loss that promotes temporal coherence. This loss constrains inter-frame velocity errors, ensuring the predicted motion exhibits realistic consistency. Extensive comparisons with state-of-the-art (SOTA) methods on benchmark datasets demonstrate RemoCap's superior performance in 3D human body reconstruction. On the 3DPW dataset, RemoCap surpasses all competitors, achieving the best results in MPVPE (81.9), MPJPE (72.7), and PA-MPJPE (44.1) metrics. Codes are available at https://wanghongsheng01.github.io/RemoCap/.

RemoCap: Disentangled Representation Learning for Motion Capture

TL;DR

RemoCap tackles occlusion in 3D human mesh reconstruction by disentangling spatial and motion features through two novel modules: Spatial Disentanglement (SD) and Motion Disentanglement (MD). SD isolates target-specific spatial cues within occluded scenes, while MD decouples motion features over time using sequence shuffling and temporal attention, aided by a sequence velocity loss to enforce temporal coherence. The method is trained with a combination of vertex, 3D-joint, and 2D-joint losses to optimize both geometry and motion consistency. Empirical results on 3DPW show state-of-the-art performance across intra-frame metrics MPVPE, MPJPE, and PA-MPJPE, highlighting strong occlusion handling and temporal stability, with competitive results on Human3.6M and clear qualitative improvements in challenging scenes. Overall, RemoCap introduces a practical, transformer-backed, model-free approach that advances robust 3D human mesh reconstruction in real-world, occluded environments.

Abstract

Reconstructing 3D human bodies from realistic motion sequences remains a challenge due to pervasive and complex occlusions. Current methods struggle to capture the dynamics of occluded body parts, leading to model penetration and distorted motion. RemoCap leverages Spatial Disentanglement (SD) and Motion Disentanglement (MD) to overcome these limitations. SD addresses occlusion interference between the target human body and surrounding objects. It achieves this by disentangling target features along the dimension axis. By aligning features based on their spatial positions in each dimension, SD isolates the target object's response within a global window, enabling accurate capture despite occlusions. The MD module employs a channel-wise temporal shuffling strategy to simulate diverse scene dynamics. This process effectively disentangles motion features, allowing RemoCap to reconstruct occluded parts with greater fidelity. Furthermore, this paper introduces a sequence velocity loss that promotes temporal coherence. This loss constrains inter-frame velocity errors, ensuring the predicted motion exhibits realistic consistency. Extensive comparisons with state-of-the-art (SOTA) methods on benchmark datasets demonstrate RemoCap's superior performance in 3D human body reconstruction. On the 3DPW dataset, RemoCap surpasses all competitors, achieving the best results in MPVPE (81.9), MPJPE (72.7), and PA-MPJPE (44.1) metrics. Codes are available at https://wanghongsheng01.github.io/RemoCap/.
Paper Structure (23 sections, 4 equations, 11 figures, 4 tables)

This paper contains 23 sections, 4 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: (a) Image input with occlusion, (b) Distortion in the reconstruction of occluded parts by GLoT shen2023global, (c) Optical flow information for the scene.
  • Figure 2: This figure illustrates the pipeline of the RemoCap model. Feature maps first undergo disentanglement by the Spatial Disentanglement (SD) and Temporal Disentanglement (TD) modules. The disentangled features are then reweighted using a sigmoid function before being decoded by a Transformer encoder to generate the final sequence of 3D human mesh vertices.
  • Figure 3: Feature Disentanglement Module Internal Details.
  • Figure 4: The image presents a comparative analysis of the performance of different algorithms, showcasing their ability to handle complex scenes such as a person riding a bicycle. The original video sequence is shown on the far left, serving as a benchmark. Progressing to the right, the processed outputs from three different algorithms are displayed: Our method, Fastmetro cho2022cross, and GLoT shen2023global.
  • Figure 5: We visually compare the features extracted by the original CNN with the heatmaps before and after using the SD and the TD modules.
  • ...and 6 more figures