Table of Contents
Fetching ...

Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization

Yilin Wen, Kechuan Dong, Yusuke Sugano

TL;DR

This paper addresses error accumulation in online test-time adaptation for 3D human pose estimation from streaming video. It proposes motion discretization by unsupervised clustering in the latent motion space to generate anchor motions that regularize the pose estimator, and introduces a self-replay mechanism for the motion denoising network, plus a soft-reset EMA to stabilize updates of the pose estimator. The method couples a pose estimator $\mathrm{F}$ with a motion denoising network $\mathrm{M}$, updating them cyclically, and uses an anchor loss $L_{ach}$ together with standard losses to guide adaptation. Experiments on Ego-Exo4D and 3DPW show improved robustness and accuracy over prior online TTA methods, validating the effectiveness of motion discretization and continuous personalized adaptation. Overall, the approach offers a practical pathway to exploit individual shape and habitual motion traits during long-term online adaptation.

Abstract

Online test-time adaptation addresses the train-test domain gap by adapting the model on unlabeled streaming test inputs before making the final prediction. However, online adaptation for 3D human pose estimation suffers from error accumulation when relying on self-supervision with imperfect predictions, leading to degraded performance over time. To mitigate this fundamental challenge, we propose a novel solution that highlights the use of motion discretization. Specifically, we employ unsupervised clustering in the latent motion representation space to derive a set of anchor motions, whose regularity aids in supervising the human pose estimator and enables efficient self-replay. Additionally, we introduce an effective and efficient soft-reset mechanism by reverting the pose estimator to its exponential moving average during continuous adaptation. We examine long-term online adaptation by continuously adapting to out-of-domain streaming test videos of the same individual, which allows for the capture of consistent personal shape and motion traits throughout the streaming observation. By mitigating error accumulation, our solution enables robust exploitation of these personal traits for enhanced accuracy. Experiments demonstrate that our solution outperforms previous online test-time adaptation methods and validate our design choices.

Robust Long-term Test-Time Adaptation for 3D Human Pose Estimation through Motion Discretization

TL;DR

This paper addresses error accumulation in online test-time adaptation for 3D human pose estimation from streaming video. It proposes motion discretization by unsupervised clustering in the latent motion space to generate anchor motions that regularize the pose estimator, and introduces a self-replay mechanism for the motion denoising network, plus a soft-reset EMA to stabilize updates of the pose estimator. The method couples a pose estimator with a motion denoising network , updating them cyclically, and uses an anchor loss together with standard losses to guide adaptation. Experiments on Ego-Exo4D and 3DPW show improved robustness and accuracy over prior online TTA methods, validating the effectiveness of motion discretization and continuous personalized adaptation. Overall, the approach offers a practical pathway to exploit individual shape and habitual motion traits during long-term online adaptation.

Abstract

Online test-time adaptation addresses the train-test domain gap by adapting the model on unlabeled streaming test inputs before making the final prediction. However, online adaptation for 3D human pose estimation suffers from error accumulation when relying on self-supervision with imperfect predictions, leading to degraded performance over time. To mitigate this fundamental challenge, we propose a novel solution that highlights the use of motion discretization. Specifically, we employ unsupervised clustering in the latent motion representation space to derive a set of anchor motions, whose regularity aids in supervising the human pose estimator and enables efficient self-replay. Additionally, we introduce an effective and efficient soft-reset mechanism by reverting the pose estimator to its exponential moving average during continuous adaptation. We examine long-term online adaptation by continuously adapting to out-of-domain streaming test videos of the same individual, which allows for the capture of consistent personal shape and motion traits throughout the streaming observation. By mitigating error accumulation, our solution enables robust exploitation of these personal traits for enhanced accuracy. Experiments demonstrate that our solution outperforms previous online test-time adaptation methods and validate our design choices.

Paper Structure

This paper contains 47 sections, 6 equations, 9 figures, 13 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of personal shape and habitual motion traits across observations (upper) and error accumulation in existing works as adaptation progresses (lower).
  • Figure 2: Framework Overview. During test time, the pose estimator $\mathrm{F}$ and motion denoising network $\mathrm{M}$ are alternately updated in a cyclic way. We employ motion discretization to regularize the adaptation of $\mathrm{F}$ and enable self-replay for adapting $\mathrm{M}$.
  • Figure 3: Our qualitative results. We show both 2D projections and 3D estimations in camera space.
  • Figure 4: Error difference versus adaptation progress over time, with adaptation progress expressed relative to the total recording length for each participant. We plot the error difference relative to the baseline w/o continuous adaptation (green), which always starts from pre-trained weights for each $\mathcal{V}$. Our complete solution (blue) is compared against the variant of w/o motion discretization (red), which removes both $L_{ach}$ and self-replay. A lower $y$-axis value indicates better performance.
  • Figure 5: Visualization of the retrieved anchor $\vb*{\theta}^\ast$ for the input video depicting basketball shooting, after adapting over 40mins of observation. Our self-replay mechanism facilitates decoding of realistic and regular anchor motions (e.g. leg pose) throughout the adaptation process.
  • ...and 4 more figures