Table of Contents
Fetching ...

Using Motion Cues to Supervise Single-Frame Body Pose and Shape Estimation in Low Data Regimes

Andrey Davydov, Alexey Sidnev, Artsiom Sanakoyeu, Yuhua Chen, Mathieu Salzmann, Pascal Fua

TL;DR

This work tackles data scarcity in monocular 3D human pose and shape estimation by leveraging motion signals from unannotated videos. It introduces a motion-consistency loss that aligns the optical flow $ ext{F}_{ ext{OF}}$ with the flow inferred from SMPL mesh changes $ ext{F}_{m{B}}$, applied as weak supervision to refine a single-frame baseline network while keeping test-time inference strictly single-frame. The approach blends forward-backward flow alignment, an anchoring strategy to avoid degenerate solutions, and optional temporal context to bridge toward video-based methods, achieving notable improvements in $P$-MPJPE and motion smoothness across backbones and data regimes. It also demonstrates the complementary value of combining optical flow with texture cues or 2D keypoints, and shows that more unlabeled data further enhances performance, all in a data-efficient, privacy-conscious framework. Overall, the method provides a practical path to data-efficient monocular pose estimation that can leverage the abundant unlabeled videos available in the wild, with potential extensions to other moving-object domains such as animals.

Abstract

When enough annotated training data is available, supervised deep-learning algorithms excel at estimating human body pose and shape using a single camera. The effects of too little such data being available can be mitigated by using other information sources, such as databases of body shapes, to learn priors. Unfortunately, such sources are not always available either. We show that, in such cases, easy-to-obtain unannotated videos can be used instead to provide the required supervisory signals. Given a trained model using too little annotated data, we compute poses in consecutive frames along with the optical flow between them. We then enforce consistency between the image optical flow and the one that can be inferred from the change in pose from one frame to the next. This provides enough additional supervision to effectively refine the network weights and to perform on par with methods trained using far more annotated data.

Using Motion Cues to Supervise Single-Frame Body Pose and Shape Estimation in Low Data Regimes

TL;DR

This work tackles data scarcity in monocular 3D human pose and shape estimation by leveraging motion signals from unannotated videos. It introduces a motion-consistency loss that aligns the optical flow with the flow inferred from SMPL mesh changes , applied as weak supervision to refine a single-frame baseline network while keeping test-time inference strictly single-frame. The approach blends forward-backward flow alignment, an anchoring strategy to avoid degenerate solutions, and optional temporal context to bridge toward video-based methods, achieving notable improvements in -MPJPE and motion smoothness across backbones and data regimes. It also demonstrates the complementary value of combining optical flow with texture cues or 2D keypoints, and shows that more unlabeled data further enhances performance, all in a data-efficient, privacy-conscious framework. Overall, the method provides a practical path to data-efficient monocular pose estimation that can leverage the abundant unlabeled videos available in the wild, with potential extensions to other moving-object domains such as animals.

Abstract

When enough annotated training data is available, supervised deep-learning algorithms excel at estimating human body pose and shape using a single camera. The effects of too little such data being available can be mitigated by using other information sources, such as databases of body shapes, to learn priors. Unfortunately, such sources are not always available either. We show that, in such cases, easy-to-obtain unannotated videos can be used instead to provide the required supervisory signals. Given a trained model using too little annotated data, we compute poses in consecutive frames along with the optical flow between them. We then enforce consistency between the image optical flow and the one that can be inferred from the change in pose from one frame to the next. This provides enough additional supervision to effectively refine the network weights and to perform on par with methods trained using far more annotated data.
Paper Structure (37 sections, 8 equations, 10 figures, 10 tables)

This paper contains 37 sections, 8 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Method Overview. Self-supervised training with optical flow guidance ( Top). Given a pre-trained network that takes single frames as input, we compute SMPL body mesh estimates $\bm{B}$ in consecutive frames along with the optical flow $\mathbf{F}_{\textit{OF}}$ between them. This lets us write a loss term based on the consistency of $\mathbf{F}_{\textit{OF}}$ and of the flow $\mathbf{F}_{\bm{B}}$ that can be inferred from the changes in $\bm{B}$ across the frames. We then minimize a weighted sum of this loss and the one used to pre-train the network ( Bottom left). At inference time ( Bottom right), the refined body estimator can run on single images and does not require the optical flow anymore.
  • Figure 2: Point-wise nature of the $\textit{OF}$ motion-alignment loss. Points $p_1$ and $p_2$ are the predictions for the corresponding frames. If ground-truth labels are given (left), then supervision is straightforward. In the absence of ground truth (right), we use $p_1$ displaced by the $\textit{OF}$ estimate as a weak label for $p_2$.
  • Figure 3: Our method vs TexturePose. We plot P-MPJPE as a function of the percentage of the annotations we used before refinement for the baseline model and after semi-supervised refinement using either TexturePose, our approach with OF only, or both OF and TexturePose. Our approach outperforms TexturePose, and, as expected, the less annotated data we use, the greater the improvement. For video dataset, Human3.6M was used. All metrics are P-MPJPE on 3DPW-test set.
  • Figure 4: More unlabelled videos. We plot P-MPJPE as a function of the amount of unlabelled data used for fine-tuning BL using our approach. As expected, the more unannotated data we use, the greater the improvement. All metrics are P-MPJPE on 3DPW-test set.
  • Figure 5: Different backbones. We apply our fine-tuning strategy using two different backbones, ResNet50 ( solid) and HRNet-W32 ( dashed). Our approach delivers a significant improvement in both cases. Note that the complex volumetric features of HRNet-W32 perform better than those of ResNet when we use a lot training data and worse when we do not.
  • ...and 5 more figures