Table of Contents
Fetching ...

Semantics-aware Test-time Adaptation for 3D Human Pose Estimation

Qiuxia Lin, Rongyu Chen, Kerui Gu, Angela Yao

TL;DR

This work tackles semantic misalignment in test time adaptation for 3D human pose estimation by introducing a semantics aware motion prior that aligns predicted motion with video action semantics via MotionCLIP, complemented by a 2D pose completion mechanism guided by motion text similarity. It couples a motion text alignment loss with a 2D pose EMA update and a smoothing term to produce semantically coherent pose sequences during inference, even under occlusion. Empirical results on 3DPW, 3DHP, and EgoBody show substantial improvements over prior TTA methods, including over 12% reductions in PA MPJPE on challenging data, while maintaining efficiency. Limitations include reliance on motion language models and potential VLM mislabeling, motivating future work on more robust semantic representations and flexible motion dictionaries.

Abstract

This work highlights a semantics misalignment in 3D human pose estimation. For the task of test-time adaptation, the misalignment manifests as overly smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware motion prior for the test-time adaptation of 3D pose estimation. We leverage video understanding and a well-structured motion-text space to adapt the model motion prediction to adhere to video semantics during test time. Additionally, we incorporate a missing 2D pose completion based on the motion-text similarity. The pose completion strengthens the motion prior's guidance for occlusions and truncations. Our method significantly improves state-of-the-art 3D human pose estimation TTA techniques, with more than 12% decrease in PA-MPJPE on 3DPW and 3DHP.

Semantics-aware Test-time Adaptation for 3D Human Pose Estimation

TL;DR

This work tackles semantic misalignment in test time adaptation for 3D human pose estimation by introducing a semantics aware motion prior that aligns predicted motion with video action semantics via MotionCLIP, complemented by a 2D pose completion mechanism guided by motion text similarity. It couples a motion text alignment loss with a 2D pose EMA update and a smoothing term to produce semantically coherent pose sequences during inference, even under occlusion. Empirical results on 3DPW, 3DHP, and EgoBody show substantial improvements over prior TTA methods, including over 12% reductions in PA MPJPE on challenging data, while maintaining efficiency. Limitations include reliance on motion language models and potential VLM mislabeling, motivating future work on more robust semantic representations and flexible motion dictionaries.

Abstract

This work highlights a semantics misalignment in 3D human pose estimation. For the task of test-time adaptation, the misalignment manifests as overly smoothed and unguided predictions. The smoothing settles predictions towards some average pose. Furthermore, when there are occlusions or truncations, the adaptation becomes fully unguided. To this end, we pioneer the integration of a semantics-aware motion prior for the test-time adaptation of 3D pose estimation. We leverage video understanding and a well-structured motion-text space to adapt the model motion prediction to adhere to video semantics during test time. Additionally, we incorporate a missing 2D pose completion based on the motion-text similarity. The pose completion strengthens the motion prior's guidance for occlusions and truncations. Our method significantly improves state-of-the-art 3D human pose estimation TTA techniques, with more than 12% decrease in PA-MPJPE on 3DPW and 3DHP.

Paper Structure

This paper contains 22 sections, 10 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: We provide two examples to illustrate the issues in TTA methods. Compared to CycleAdapt cycleadapt, our method enables accurate bending of the legs in the "climbing-the-stairs" scenario and "walking" in the occluded scenario, demonstrating the effectiveness of incorporating semantics-aware motion information for both visible and occluded keypoints.
  • Figure 2: Overall framework: we perform model fine-tuning and 2D pose update in each epoch. Model Fine-tuning (Left): The extracted video segment and its text label will be used to update $f^{\text{HPE}}$ with 2D projection, alignment, and smooth losses. 2D Pose Update (Right): After model fine-tuning in each epoch, all images in the current test video are processed through $f^{\text{HPE}}$ to obtain predicted 2D and update the 2D keypoints using EMA. Subsequently, all video segments are checked to fill in any missing 2D keypoints by verifying conditions of motion-text alignment.
  • Figure 3: The improvement distribution on 3DPW dataset based on HMR network. We only show the top 7 semantics for brevity. The improved motions include both common and rare semantics found in action datasets.
  • Figure 4: The relationship between the fill-in threshold, number of keypoints, and 2D keypoint PCK. Increasing the threshold leads to a decrease in the ratio of filled-in keypoints, but an improvement in their quality, as indicated by a higher PCK.
  • Figure 5: Qualitative comparison on one video sequence from 3DPW dataset 3dpw with text label "Walking". From epoch 2 to epoch 4, results show that motion-text alignment alone is insufficient to preserve motion semantics with missing 2D data. We use green and red squares to highlight the transition from walking to standing poses. Incorporating 2D pose updates (Ours) helps mitigate this problem. Compared to CycleAdapt cycleadapt, which shows static standing foot under image truncation, ours captures the natural alternation of foot and knee bending during walking, aligning more closely with the actual walking sequences.
  • ...and 6 more figures