Table of Contents
Fetching ...

Online Test-time Adaptation for 3D Human Pose Estimation: A Practical Perspective with Estimated 2D Poses

Qiuxia Lin, Kerui Gu, Linlin Yang, Angela Yao

TL;DR

This paper tackles online test-time adaptation for 3D human pose estimation in streaming video using estimated 2D poses as supervision, a setting where naive updates can catastrophically propagate errors. It proposes a threefold framework—adaptive aggregation, two-stage optimization, and local augmentation—to limit the impact of noisy 2D estimates while still leveraging informative historical data and current frame cues. Through memory-based representative sampling, confidence-aware pseudo-labeling, per-sample staged updates, and augmentation from nearby confident frames, the method achieves large improvements over state-of-the-art on challenging datasets (e.g., 3DPW and 3DHP) and demonstrates robustness across different 2D estimators. The work advances practical OTTA for 3D pose estimation by enabling reliable adaptation when only estimated 2D poses are available, a common real-world scenario.

Abstract

Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data. Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice. This paper addresses adapting models to streaming videos with estimated 2D poses. Comparing adaptations reveals the challenge of limiting estimation errors while preserving accurate pose information. To this end, we propose adaptive aggregation, a two-stage optimization, and local augmentation for handling varying levels of estimated pose error. First, we perform adaptive aggregation across videos to initialize the model state with labeled representative samples. Within each video, we use a two-stage optimization to benefit from 2D fitting while minimizing the impact of erroneous updates. Second, we employ local augmentation, using adjacent confident samples to update the model before adapting to the current non-confident sample. Our method surpasses state-of-the-art by a large margin, advancing adaptation towards more practical settings of using estimated 2D poses.

Online Test-time Adaptation for 3D Human Pose Estimation: A Practical Perspective with Estimated 2D Poses

TL;DR

This paper tackles online test-time adaptation for 3D human pose estimation in streaming video using estimated 2D poses as supervision, a setting where naive updates can catastrophically propagate errors. It proposes a threefold framework—adaptive aggregation, two-stage optimization, and local augmentation—to limit the impact of noisy 2D estimates while still leveraging informative historical data and current frame cues. Through memory-based representative sampling, confidence-aware pseudo-labeling, per-sample staged updates, and augmentation from nearby confident frames, the method achieves large improvements over state-of-the-art on challenging datasets (e.g., 3DPW and 3DHP) and demonstrates robustness across different 2D estimators. The work advances practical OTTA for 3D pose estimation by enabling reliable adaptation when only estimated 2D poses are available, a common real-world scenario.

Abstract

Online test-time adaptation for 3D human pose estimation is used for video streams that differ from training data. Ground truth 2D poses are used for adaptation, but only estimated 2D poses are available in practice. This paper addresses adapting models to streaming videos with estimated 2D poses. Comparing adaptations reveals the challenge of limiting estimation errors while preserving accurate pose information. To this end, we propose adaptive aggregation, a two-stage optimization, and local augmentation for handling varying levels of estimated pose error. First, we perform adaptive aggregation across videos to initialize the model state with labeled representative samples. Within each video, we use a two-stage optimization to benefit from 2D fitting while minimizing the impact of erroneous updates. Second, we employ local augmentation, using adjacent confident samples to update the model before adapting to the current non-confident sample. Our method surpasses state-of-the-art by a large margin, advancing adaptation towards more practical settings of using estimated 2D poses.

Paper Structure

This paper contains 16 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) Adaptive Aggregation: Representative samples from both confident and non-confident data are selected using spherical K-means, with memory banks storing these samples for limited retrieval during video transitions. (b) Two-stage Optimization: Each sample undergoes two-stage adaptation to maintain 2D fitting and minimize incorrect updates.
  • Figure 2: Local augmentation. We augment the temporally adjacent confident samples to simulate the hard features in non-confident samples. The transformed predictions are used to guide the model before adapting to the non-confident sample. Specifically, we first adapt the model using augmented confident samples to obtain Model'. Then, Model' is further adapted with non-confident samples to obtain the final Model*, which is used for prediction.
  • Figure 3: The two-stage optimization benefits from 2D projection while minimizing introduced errors.
  • Figure 4: Qualitative results on a 3DPW test sample comparing our method with EFT eft, DynaBOA dynaboa and CycleAdapt* cycleadapt. We present the estimated 2D results (a) with confident keypoints colored green and non-confident keypoints colored orange. Our method shows better results in challenging cases involving truncation and complex backgrounds. *Modified CycleAdapt source code to accommodate the online setting.