Table of Contents
Fetching ...

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

Nicola Garau, Giulia Martinelli, Niccolò Bisagno, Denis Tomè, Carsten Stoll

TL;DR

Monocular 3D HPE is inherently ill-posed due to depth ambiguity. The paper introduces EPOCH, a camera-in-the-loop framework comprising LiftNet for unsupervised 3D lifting using a full perspective camera, and RegNet for weakly supervised estimation of 2D pose and full camera parameters from a single image, aided by Normalizing Flows to enforce pose plausibility and cycle-consistency losses for self-supervision. The method achieves state-of-the-art results on Human3.6M and MPI-INF-3DHP and demonstrates strong generalization to unseen in-the-wild data like 3DPW, without requiring camera ground truth. This work advances fully unsupervised 3D HPE by integrating explicit camera modeling, enabling robust, camera-aware pose estimation in diverse environments.

Abstract

Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].

EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans

TL;DR

Monocular 3D HPE is inherently ill-posed due to depth ambiguity. The paper introduces EPOCH, a camera-in-the-loop framework comprising LiftNet for unsupervised 3D lifting using a full perspective camera, and RegNet for weakly supervised estimation of 2D pose and full camera parameters from a single image, aided by Normalizing Flows to enforce pose plausibility and cycle-consistency losses for self-supervision. The method achieves state-of-the-art results on Human3.6M and MPI-INF-3DHP and demonstrates strong generalization to unseen in-the-wild data like 3DPW, without requiring camera ground truth. This work advances fully unsupervised 3D HPE by integrating explicit camera modeling, enabling robust, camera-aware pose estimation in diverse environments.

Abstract

Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].
Paper Structure (17 sections, 19 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 19 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a) In human pose estimation, classical approaches perform a direct regression of the 2D/3D joint location directly from an image. If the ground truth is available, the camera parameters can be used/learned to refine the accuracy. (b) Lifting approaches aim at retrieving the depth of each 2D joint to obtain the 3D pose. (c) We propose a novel paradigm, directly estimating the 3D pose and the camera from images. The 2D pose can be calculated by applying the projection of the 3D coordinates to the image space using the camera parameters. (d) Starting from the estimated 2D poses and camera parameters, we perform the lifting to 3D, improving the performances with respect to current approaches.
  • Figure 2: In (I), we define two vectors, denoted as $A$ and $B$, connecting the spine and the hip joints. The cross product of these vectors yields the normal vector $N$, which aligns with the walking direction. In (II) and (III), we show the outcome of the dot product between $N$ and the proximal $p_l$ and distal $d_l$ components, resulting in their projections $D_l$ and $P_l$. In (II), $\mathcal{L}_{limbs}$ gives an output of 0, indicating a anthropomorphically complaint prediction. In (III), $\mathcal{L}_{limbs}$ returns a positive value, signaling the need for further correction.
  • Figure 3: LiftNet architecture. The red ($2D \rightarrow 3D$), orange ($\circlearrowleft$ and $\circlearrowright$) and yellow ($3D \rightarrow 2D$) blocks describe the Lift, Rotate, Project operations respectively. The symbol $x$ denotes a 2D pose, $y$ denotes a 3D pose. The decorator $\,\hat{}\,$ symbolizes a prediction in the forward pass while $\,\widetilde{}\,$ marks a prediction in the backward pass. The subscript $_r$ stands for rotated. The solid arrows describe the flow of the network, while the dashed arrows connect each intermediate datum to its loss.
  • Figure 4: RegNet architecture. The $W \times H$ input image is fed to (a) a contrastive-pretrained encoder and a separate module $\Psi$ that estimates the intrinsic parameters. The output features are then concatenated and (b) fed into our attention-based capsule decoder. The outputs are three separate capsule vectors, representing an estimation of the 3D pose $\hat{y}$, of the camera $[K] [R|t]$ and a joint presence vector $\Sigma$. (c) Each of the outputs needs to be further processed before the loss computation. A copy of $\hat{y}$ is randomly rotated around the vertical axis, obtaining $\hat{y}_r$. $\hat{y}$ and $\hat{y}_r$ are projected into the camera plane and $\Sigma$ goes through a sigmoid activation function. (d) $\hat{y}$, $\hat{x}_r$, $\hat{x}$ and $\hat{\sigma}$ are fed to the loss functions.
  • Figure 5: EPOCH qualitative results on MPI-INF-3DHP mono-3dhp2017 (columns 1, 2, 3, 4), 3DPW von2018recovering (columns 5, 6). Rows: input images, RegNet output, LiftNet output (front and side view). Our method can generalize to unseen in-the-wild data (3DPW) even if only trained on Human3.6M data.
  • ...and 2 more figures