Table of Contents
Fetching ...

Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

Kaiwen Wang, Kaili Zheng, Yiming Shi, Chenyi Guo, Ji Wu

TL;DR

This work tackles multi-person mesh recovery from a single image by addressing scene-level inconsistencies that arise from per-person pGT generation. It introduces Depth-conditioned Translation Optimization (DTO), a MAP-based framework that jointly refines camera-space translations using anthropometric height priors and monocular depth cues to achieve scene-consistent placements, and constructs the DTO-Humans pGT dataset. Building on this, Metric-Aware HMR (MA-HMR) extends an end-to-end network with a camera branch and a relative metric loss to achieve true metric-scale mesh recovery, yielding state-of-the-art results on relative depth reasoning and mesh accuracy across multiple benchmarks. The combination of DTO-Humans and MA-HMR advances scene-aware 3D human understanding with practical impact for robust multi-person reconstruction in the wild.

Abstract

Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.

Towards Metric-Aware Multi-Person Mesh Recovery by Jointly Optimizing Human Crowd in Camera Space

TL;DR

This work tackles multi-person mesh recovery from a single image by addressing scene-level inconsistencies that arise from per-person pGT generation. It introduces Depth-conditioned Translation Optimization (DTO), a MAP-based framework that jointly refines camera-space translations using anthropometric height priors and monocular depth cues to achieve scene-consistent placements, and constructs the DTO-Humans pGT dataset. Building on this, Metric-Aware HMR (MA-HMR) extends an end-to-end network with a camera branch and a relative metric loss to achieve true metric-scale mesh recovery, yielding state-of-the-art results on relative depth reasoning and mesh accuracy across multiple benchmarks. The combination of DTO-Humans and MA-HMR advances scene-aware 3D human understanding with practical impact for robust multi-person reconstruction in the wild.

Abstract

Multi-person human mesh recovery from a single image is a challenging task, hindered by the scarcity of in-the-wild training data. Prevailing in-the-wild human mesh pseudo-ground-truth (pGT) generation pipelines are single-person-centric, where each human is processed individually without joint optimization. This oversight leads to a lack of scene-level consistency, producing individuals with conflicting depths and scales within the same image. To address this, we introduce Depth-conditioned Translation Optimization (DTO), a novel optimization-based method that jointly refines the camera-space translations of all individuals in a crowd. By leveraging anthropometric priors on human height and depth cues from a monocular depth estimator, DTO solves for a scene-consistent placement of all subjects within a principled Maximum a posteriori (MAP) framework. Applying DTO to the 4D-Humans dataset, we construct DTO-Humans, a new large-scale pGT dataset of 0.56M high-quality, scene-consistent multi-person images, featuring dense crowds with an average of 4.8 persons per image. Furthermore, we propose Metric-Aware HMR, an end-to-end network that directly estimates human mesh and camera parameters in metric scale. This is enabled by a camera branch and a relative metric loss that enforces plausible relative scales. Extensive experiments demonstrate that our method achieves state-of-the-art performance on relative depth reasoning and human mesh recovery. Code is available at: https://github.com/gouba2333/MA-HMR.

Paper Structure

This paper contains 50 sections, 18 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: CamSMPLify camerahmr fits each person independently, causing height and spatial inconsistencies. Our DTO jointly optimizes human translations from height priors and depth cues, ensuring a coherent scene reconstruction.
  • Figure 2: Overview of the DTO Framework. An input image is processed through three parallel streams: an HMR Model provides initial meshes; Age & Gender Model informs a statistical height prior; and Depth Model generates a relative depth map. From the initial meshes and the depth map, we extract the inter-human depth relation and intra-human depth scale. DTO integrates these components into an MAP problem to solve for a global affine transformation, outputting a scene-consistent arrangement of all individuals.
  • Figure 3: Architecture of our Metric-Aware HMR. We enhance a SAT-HMR backbone with two key innovations for true metric-scale recovery: a camera branch that predicts the camera's Field of View from global features using a dedicated camera token, and a relative metric loss that directly supervises the real-world distances between predicted individuals.
  • Figure 4: Initial Per-Person Estimation Pipeline.
  • Figure 5: Height Priors for Minors.
  • ...and 13 more figures