Table of Contents
Fetching ...

Self-learning Canonical Space for Multi-view 3D Human Pose Estimation

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

TL;DR

Multi-view 3D human pose estimation benefits from additional geometric cues but relies on scarce annotations. The paper introduces CMANet, a fully self-supervised cascaded framework that builds a canonical parameter space defined by per-view camera pose $π^i=(R^i,t^i)$, global orientation $θ_g^i$, and shared SMPL pose $θ_b$ and shape $β$, integrated via an intra-view module (IRV) and an inter-view module (IEV). A two-stage learning procedure first trains IRV to estimate intra-view quantities using 2D keypoint reprojection and SMPL-based losses, then freezes IRV and trains IEV to fuse multi-view information and refine poses through cross-view geometry constraints without ground-truth labels. Experiments on Human3.6M, MPI-INF-3DHP, and TotalCapture show CMANet achieving state-of-the-art performance among self-supervised methods and competitive results versus supervised mesh-based approaches, validating effective canonical fusion of heterogeneous multi-view information and robust pose/mask estimation under occlusion.

Abstract

Multi-view 3D human pose estimation is naturally superior to single view one, benefiting from more comprehensive information provided by images of multiple views. The information includes camera poses, 2D/3D human poses, and 3D geometry. However, the accurate annotation of these information is hard to obtain, making it challenging to predict accurate 3D human pose from multi-view images. To deal with this issue, we propose a fully self-supervised framework, named cascaded multi-view aggregating network (CMANet), to construct a canonical parameter space to holistically integrate and exploit multi-view information. In our framework, the multi-view information is grouped into two categories: 1) intra-view information , 2) inter-view information. Accordingly, CMANet consists of two components: intra-view module (IRV) and inter-view module (IEV). IRV is used for extracting initial camera pose and 3D human pose of each view; IEV is to fuse complementary pose information and cross-view 3D geometry for a final 3D human pose. To facilitate the aggregation of the intra- and inter-view, we define a canonical parameter space, depicted by per-view camera pose and human pose and shape parameters ($θ$ and $β$) of SMPL model, and propose a two-stage learning procedure. At first stage, IRV learns to estimate camera pose and view-dependent 3D human pose supervised by confident output of an off-the-shelf 2D keypoint detector. At second stage, IRV is frozen and IEV further refines the camera pose and optimizes the 3D human pose by implicitly encoding the cross-view complement and 3D geometry constraint, achieved by jointly fitting predicted multi-view 2D keypoints. The proposed framework, modules, and learning strategy are demonstrated to be effective by comprehensive experiments and CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.

Self-learning Canonical Space for Multi-view 3D Human Pose Estimation

TL;DR

Multi-view 3D human pose estimation benefits from additional geometric cues but relies on scarce annotations. The paper introduces CMANet, a fully self-supervised cascaded framework that builds a canonical parameter space defined by per-view camera pose , global orientation , and shared SMPL pose and shape , integrated via an intra-view module (IRV) and an inter-view module (IEV). A two-stage learning procedure first trains IRV to estimate intra-view quantities using 2D keypoint reprojection and SMPL-based losses, then freezes IRV and trains IEV to fuse multi-view information and refine poses through cross-view geometry constraints without ground-truth labels. Experiments on Human3.6M, MPI-INF-3DHP, and TotalCapture show CMANet achieving state-of-the-art performance among self-supervised methods and competitive results versus supervised mesh-based approaches, validating effective canonical fusion of heterogeneous multi-view information and robust pose/mask estimation under occlusion.

Abstract

Multi-view 3D human pose estimation is naturally superior to single view one, benefiting from more comprehensive information provided by images of multiple views. The information includes camera poses, 2D/3D human poses, and 3D geometry. However, the accurate annotation of these information is hard to obtain, making it challenging to predict accurate 3D human pose from multi-view images. To deal with this issue, we propose a fully self-supervised framework, named cascaded multi-view aggregating network (CMANet), to construct a canonical parameter space to holistically integrate and exploit multi-view information. In our framework, the multi-view information is grouped into two categories: 1) intra-view information , 2) inter-view information. Accordingly, CMANet consists of two components: intra-view module (IRV) and inter-view module (IEV). IRV is used for extracting initial camera pose and 3D human pose of each view; IEV is to fuse complementary pose information and cross-view 3D geometry for a final 3D human pose. To facilitate the aggregation of the intra- and inter-view, we define a canonical parameter space, depicted by per-view camera pose and human pose and shape parameters ( and ) of SMPL model, and propose a two-stage learning procedure. At first stage, IRV learns to estimate camera pose and view-dependent 3D human pose supervised by confident output of an off-the-shelf 2D keypoint detector. At second stage, IRV is frozen and IEV further refines the camera pose and optimizes the 3D human pose by implicitly encoding the cross-view complement and 3D geometry constraint, achieved by jointly fitting predicted multi-view 2D keypoints. The proposed framework, modules, and learning strategy are demonstrated to be effective by comprehensive experiments and CMANet is superior to state-of-the-art methods in extensive quantitative and qualitative analysis.
Paper Structure (14 sections, 6 equations, 3 figures, 5 tables)

This paper contains 14 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The illustration of the two categories of multi-view information (intra-view and inter-view) and the brief idea of proposed cascaded multi-view aggregation pipeline and canonical parameter space. For demonstration simplicity, two-view images are adopted here, in practice multi-view case is compatible.
  • Figure 2: The architecture of the proposed cascaded multi-view aggregating network (CMANet) consisting of two components: intra-view module (IRV) and inter-view module (IEV). IRV estimates intra-view information, i.e., camera pose and 3D human pose and shape, of each view. After screening, IEV leverages the camera poses, global orientations, the optimal human body pose and shape to initialize proposed canonical parameter space, then aggregates the features from all views to refine the camera pose and human model to output final human 3D keypoints. Furthermore, a two-stage learning procedure is adopted. At the first stage IRV learns to extract intra-view information of each view, supervised by reprojection loss provided by 2D keypoints detector and SMPL loss provided by SMPLify. At the second stage IRV is frozen and IEV learns to fuse multi-view information supervised by reprojection loss and SMPL loss from all views.
  • Figure 3: Visualization of reconstructed human mesh and 3D pose of samples from Human3.6M (row 1), MPI-INF-3DHP (row 2) and TotalCapture (row 3).