Table of Contents
Fetching ...

Human Mesh Recovery from Arbitrary Multi-view Images

Xiaoben Li, Mancheng Meng, Ziyan Wu, Terrence Chen, Fan Yang, Dinggang Shen

TL;DR

This work tackles human mesh recovery from arbitrary multi-view images by decoupling camera pose estimation from body mesh recovery, enabling a concise and flexible architecture. A camera and body decoupling (CBD) splits the task into per-view camera pose estimation via a shared MLP (CPE) and cross-view mesh fusion via a transformer decoder with a SMPL query token (AVF), which aggregates information from any number of views. The approach uses SMPL to represent body pose and shape, and combines 2D reprojection, 3D keypoint, and SMPL parameter losses with adversarial priors to supervise training. Experiments on Human3.6M, MPI-INF-3DHP, and TotalCapture demonstrate strong performance and robust fusion across varying numbers of views, with the ViT backbone offering notable gains. The method provides a practical, calibration-free, and view-agnostic solution for 3D human reconstruction in diverse multi-view settings, with clear gains in both accuracy and flexibility over prior work.

Abstract

Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.

Human Mesh Recovery from Arbitrary Multi-view Images

TL;DR

This work tackles human mesh recovery from arbitrary multi-view images by decoupling camera pose estimation from body mesh recovery, enabling a concise and flexible architecture. A camera and body decoupling (CBD) splits the task into per-view camera pose estimation via a shared MLP (CPE) and cross-view mesh fusion via a transformer decoder with a SMPL query token (AVF), which aggregates information from any number of views. The approach uses SMPL to represent body pose and shape, and combines 2D reprojection, 3D keypoint, and SMPL parameter losses with adversarial priors to supervise training. Experiments on Human3.6M, MPI-INF-3DHP, and TotalCapture demonstrate strong performance and robust fusion across varying numbers of views, with the ViT backbone offering notable gains. The method provides a practical, calibration-free, and view-agnostic solution for 3D human reconstruction in diverse multi-view settings, with clear gains in both accuracy and flexibility over prior work.

Abstract

Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
Paper Structure (16 sections, 6 equations, 9 figures, 7 tables)

This paper contains 16 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Unified Human Mesh Recovery (U-HMR): Recovering human pose and shape from arbitrary multi-view images. We propose a concise, flexible, and effective framework for human mesh recovery from arbitrary multi-view images. From left to right the results of human mesh recovery from 1-view, 2-view, and 4-view images (from three different datasets) are shown respectively. Our framework can be directly adapted to arbitrary number of views without any modification, fine-tuning or re-training, and can learn multi-view information effectively for human mesh recovery. For the limitation of page space, up to 4-view results are displayed here. Results on more views are in supplementary material.
  • Figure 2: The comparison of multi-view scenario with a coupled structure (left) and arbitrary multi-view scenario with a decoupled structure (right). (a): The conventional model structure for human mesh recovery from multi-view image. (b): The model structure of camera pose $\pi_i$ estimation for arbitrary number of views, note that the regressor here is shared across different views. (c): The model structure of arbitrary multi-view feature fusion for body pose $\theta_b$ and shape $\beta$ estimation.
  • Figure 3: Overview of the proposed framework. We divide the task of reconstructing 3D human mesh from arbitrary multi-view images into two sub-tasks: 1) the estimation of camera parameters, 2) the estimation of body mesh (pose & shape) parameters. This is achieved by a camera and body decoupling structure (CBD). Given $N$ images of a human from different camera views, we first extract 2D features from each image using a 2D image encoder. Then the 2D feature maps are forwarded to two modules, camera pose estimation module (CPE), and arbitrary view fusion module (AVF). In CPE, feature maps of each view are fed into an MLP, which is shared across views, to predict camera parameters $\pi_i$ of each view independently. In AVF, feature maps from all views are reshaped into tokens and forwarded into a transformer decoder. Inspired by PETR liu2022petr, multi-view position embeddings are adopted to distinguish the tokens of different regions and views. A single learnable SMPL query token is introduced to attend tokens from multi-view images to form a cross-attention structure, so that the multi-view information is implicitly encoded into the SMPL query token which is subsequently used to produce final body pose parameters $\theta_b$ and shape parameters $\beta$.
  • Figure 4: Different architecture designs for human mesh recovery from multi-view images. The 2D feature maps are produced by the same 2D image encoder. All these variants output human mesh parameters and camera parameters of all views.
  • Figure 5: The performance trend w.r.t the number of views.
  • ...and 4 more figures