Table of Contents
Fetching ...

STRIDE: Single-video based Temporally Continuous Occlusion-Robust 3D Pose Estimation

Rohit Lal, Saketh Bachu, Yash Garg, Arindam Dutta, Calvin-Khang Ta, Dripta S. Raychaudhuri, Hannah Dela Cruz, M. Salman Asif, Amit K. Roy-Chowdhury

TL;DR

The paper tackles robust 3D human pose estimation under heavy occlusion by introducing STRIDE, a test-time training framework that learns a video-specific motion prior. STRIDE pre-trains a DSTFormer-based, dual-stream spatio-temporal transformer using masked sequence denoising on 3D pose data, then adaptively fine-tunes this prior for each test video with self-supervised losses to enforce temporal continuity and anatomical plausibility. The approach is model-agnostic, improving poses produced by any off-the-shelf estimator, and demonstrates state-of-the-art performance on Occluded Human3.6M and OCMotion, including scenarios with complete occlusion, while achieving significant speed-ups. This reduces reliance on large labeled occluded datasets and offers practical benefits for real-time applications in AR/VR and action recognition, though it currently focuses on single-person occlusions.

Abstract

The capability to accurately estimate 3D human poses is crucial for diverse fields such as action recognition, gait recognition, and virtual/augmented reality. However, a persistent and significant challenge within this field is the accurate prediction of human poses under conditions of severe occlusion. Traditional image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions. While video-based models benefit from processing temporal data, they encounter limitations when faced with prolonged occlusions that extend over multiple frames. This challenge arises because these models struggle to generalize beyond their training datasets, and the variety of occlusions is hard to capture in the training data. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous Occlusion-Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for each video. This approach specifically handles occlusions that were not encountered during the model's training. By employing STRIDE, we can refine a sequence of noisy initial pose estimates into accurate, temporally coherent poses during test time, effectively overcoming the limitations of prior methods. Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion, where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates. Code is made publicly available at https://github.com/take2rohit/stride

STRIDE: Single-video based Temporally Continuous Occlusion-Robust 3D Pose Estimation

TL;DR

The paper tackles robust 3D human pose estimation under heavy occlusion by introducing STRIDE, a test-time training framework that learns a video-specific motion prior. STRIDE pre-trains a DSTFormer-based, dual-stream spatio-temporal transformer using masked sequence denoising on 3D pose data, then adaptively fine-tunes this prior for each test video with self-supervised losses to enforce temporal continuity and anatomical plausibility. The approach is model-agnostic, improving poses produced by any off-the-shelf estimator, and demonstrates state-of-the-art performance on Occluded Human3.6M and OCMotion, including scenarios with complete occlusion, while achieving significant speed-ups. This reduces reliance on large labeled occluded datasets and offers practical benefits for real-time applications in AR/VR and action recognition, though it currently focuses on single-person occlusions.

Abstract

The capability to accurately estimate 3D human poses is crucial for diverse fields such as action recognition, gait recognition, and virtual/augmented reality. However, a persistent and significant challenge within this field is the accurate prediction of human poses under conditions of severe occlusion. Traditional image-based estimators struggle with heavy occlusions due to a lack of temporal context, resulting in inconsistent predictions. While video-based models benefit from processing temporal data, they encounter limitations when faced with prolonged occlusions that extend over multiple frames. This challenge arises because these models struggle to generalize beyond their training datasets, and the variety of occlusions is hard to capture in the training data. Addressing these challenges, we propose STRIDE (Single-video based TempoRally contInuous Occlusion-Robust 3D Pose Estimation), a novel Test-Time Training (TTT) approach to fit a human motion prior for each video. This approach specifically handles occlusions that were not encountered during the model's training. By employing STRIDE, we can refine a sequence of noisy initial pose estimates into accurate, temporally coherent poses during test time, effectively overcoming the limitations of prior methods. Our framework demonstrates flexibility by being model-agnostic, allowing us to use any off-the-shelf 3D pose estimation method for improving robustness and temporal consistency. We validate STRIDE's efficacy through comprehensive experiments on challenging datasets like Occluded Human3.6M, Human3.6M, and OCMotion, where it not only outperforms existing single-image and video-based pose estimation models but also showcases superior handling of substantial occlusions, achieving fast, robust, accurate, and temporally consistent 3D pose estimates. Code is made publicly available at https://github.com/take2rohit/stride
Paper Structure (12 sections, 6 equations, 4 figures, 5 tables)

This paper contains 12 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Effect of occlusions on pose estimation. Image-based 3D pose estimators black2023bedlam often struggle with heavy occlusions, as illustrated in this figure. Without temporal context, predictions on highly obscured frames are inconsistent with prior poses, like the erroneous pose in the third column. Notably, even state-of-the-art video approaches shin2023wham fail on prolonged full occlusions spanning multiple frames, as in columns 4 and 5. This highlights yet another critical limitation - models are brittle when deployed outside their training distributions. Without training examples of such long-duration occlusions, models fail to extrapolate reasonable poses. Our work addresses this through test-time training of a human motion prior. By fine-tuning on each new video, we tailor this parametric prior to handling sequence-specific occlusion patterns not observed during training. Given an initial noisy estimate, our approach refines the pose sequence into an accurate, temporally coherent output, as shown in the final row.
  • Figure 2: Overview of our approach. Our method enhances 3D pose estimation for occluded videos through test-time training of a motion prior model. We first extract initial 3D pose estimates from the test video using any 3D off-the-shelf pose estimator. To address occlusions and test distribution shifts, we then fine-tune the motion prior on that specific video by optimizing for smooth and continuous poses over the sequence.
  • Figure 3: The presented figure illustrates the pipeline for our temporally continuous pose estimation, STRIDE. Initially, we pre-train a motion prior model, denoted as $\mathcal{M}$, using a diverse set of 3D pose data sourced from various public datasets. The primary objective of this motion prior model is to generate a sequence of poses that exhibit temporal continuity when provided with a sequence of initially noisy poses. Moving into the single video training stage, we acquire a sequence of noisy poses using a 3D pose estimation model, $\mathcal{P}$. The weights of $\mathcal{P}$ are held constant during this phase. Subsequently, we pass this noisy pose sequence through the motion prior model $\mathcal{M}$ and retrain it using various supervised losses, as outlined in Equation \ref{['eq:final_loss']}. The end result of this training process is a model capable of producing temporally continuous 3D poses for that specific video.
  • Figure 4: 3D pose estimation results on OCMotion (0013, Camera01). This figure demonstrates how our method incorporates temporal continuity into video sequences under occlusion. The second row represents 3D poses predicted by CycleAdapt cycleadapt. The third row represents 3D poses predicted by STRIDE. Note: The 3D poses shown in translucent red color in the second and third row represent the ground truths.