Table of Contents
Fetching ...

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

Christian Keilstrup Ingwersen, Rasmus Tirsgaard, Rasmus Nylander, Janus Nørtoft Jensen, Anders Bjorholm Dahl, Morten Rieger Hannemose

TL;DR

This work tackles the challenge of monocular 3D human pose estimation by leveraging multiview data during training to impose a consistency constraint across two synchronized views, while performing inference from a single image. The core idea is a multiview consistency loss that aligns predicted pose sequences from different views using Procrustes analysis, avoiding any need for camera intrinsics or extrinsics. The approach enables effective fine-tuning with only 2D data or with limited 3D data, and achieves state-of-the-art semi-supervised performance on Human3.6M, as well as robust results on SportsPose and SkiPose. Practically, the method facilitates domain adaptation and data collection with off-the-shelf, uncalibrated cameras, broadening the applicability of monocular 3D pose estimation in real-world settings.

Abstract

Deducing a 3D human pose from a single 2D image is inherently challenging because multiple 3D poses can correspond to the same 2D representation. 3D data can resolve this pose ambiguity, but it is expensive to record and requires an intricate setup that is often restricted to controlled lab environments. We propose a method that improves the performance of deep learning-based monocular 3D human pose estimation models by using multiview data only during training, but not during inference. We introduce a novel loss function, consistency loss, which operates on two synchronized views. This approach is simpler than previous models that require 3D ground truth or intrinsic and extrinsic camera parameters. Our consistency loss penalizes differences in two pose sequences after rigid alignment. We also demonstrate that our consistency loss substantially improves performance for fine-tuning without requiring 3D data. Furthermore, we show that using our consistency loss can yield state-of-the-art performance when training models from scratch in a semi-supervised manner. Our findings provide a simple way to capture new data, e.g in a new domain. This data can be added using off-the-shelf cameras with no calibration requirements. We make all our code and data publicly available.

Two Views Are Better than One: Monocular 3D Pose Estimation with Multiview Consistency

TL;DR

This work tackles the challenge of monocular 3D human pose estimation by leveraging multiview data during training to impose a consistency constraint across two synchronized views, while performing inference from a single image. The core idea is a multiview consistency loss that aligns predicted pose sequences from different views using Procrustes analysis, avoiding any need for camera intrinsics or extrinsics. The approach enables effective fine-tuning with only 2D data or with limited 3D data, and achieves state-of-the-art semi-supervised performance on Human3.6M, as well as robust results on SportsPose and SkiPose. Practically, the method facilitates domain adaptation and data collection with off-the-shelf, uncalibrated cameras, broadening the applicability of monocular 3D pose estimation in real-world settings.

Abstract

Deducing a 3D human pose from a single 2D image is inherently challenging because multiple 3D poses can correspond to the same 2D representation. 3D data can resolve this pose ambiguity, but it is expensive to record and requires an intricate setup that is often restricted to controlled lab environments. We propose a method that improves the performance of deep learning-based monocular 3D human pose estimation models by using multiview data only during training, but not during inference. We introduce a novel loss function, consistency loss, which operates on two synchronized views. This approach is simpler than previous models that require 3D ground truth or intrinsic and extrinsic camera parameters. Our consistency loss penalizes differences in two pose sequences after rigid alignment. We also demonstrate that our consistency loss substantially improves performance for fine-tuning without requiring 3D data. Furthermore, we show that using our consistency loss can yield state-of-the-art performance when training models from scratch in a semi-supervised manner. Our findings provide a simple way to capture new data, e.g in a new domain. This data can be added using off-the-shelf cameras with no calibration requirements. We make all our code and data publicly available.
Paper Structure (16 sections, 8 equations, 4 figures, 4 tables)

This paper contains 16 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We improve monocular performance by applying our consistency loss during training to predicted 3D pose sequences from two different views. The consistency loss penalizes variations between the two predicted pose sequences of the same activity. Note that we only use multiple views during training. For every predicted 3D pose sequence obtained from View A and View B, we compute a similarity transform with Procrustes Analysis. This transformation aligns the predicted poses in Sequence A with Sequence B. The consistency loss is the average 3D distance between the two pose sequences post-alignment, shown as dashed red lines. Using Procrustes analysis for this transformation enables us to use cameras with unknown intrinsics and extrinsics.
  • Figure 2: The five activities from SportsPose ingwersen2023sportspose. The top row displays the publicly available view "right". The bottom row features a view rotated 90 degrees relative to "right", which we refer to as "View 1".
  • Figure 3: Visual comparison of predictions in green and the ground truth pose in blue. The magnitude of errors, measured in millimeters and indicated at the top, highlights the superiority of our consistency loss $\mathcal{L}_{\text{2D}_\text{con}}$ in achieving more accurate results. The notable improvement is especially evident in the bottom row, where the method employing our consistency loss successfully captures the complex movement.
  • Figure 4: Investigation of how MPJPE and PA-MPJPE are affected when varying the number of views available with different losses. Top: With consistency loss, $\lambda_{\text{con}}=1$, bottom without consistency loss. Left: 2D, right: 3D.