Table of Contents
Fetching ...

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, Thomas Lucas

TL;DR

Multi-HMR presents a first single-shot approach for multi-person whole-body human mesh recovery from a single RGB image, integrating SMPL-X-based body, hands, and facial expressions with 3D camera-space localization. It uses a Vision Transformer backbone to extract image tokens and a cross-attention based Human Perception Head to regress per-person SMPL-X parameters and depth, optionally incorporating camera intrinsics via Fourier-encoded ray directions. A dedicated synthetic CUFFS dataset enhances hand pose learning, enabling high-fidelity hand/face predictions without high-resolution crops and delivering real-time performance on modest backbones and state-of-the-art results on larger models. The method demonstrates strong capabilities across body-only and whole-body benchmarks, scales well with the number of people, and provides practical utility for AR/VR, robotics, and immersive perception tasks.

Abstract

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on $448{\times}448$ images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

TL;DR

Multi-HMR presents a first single-shot approach for multi-person whole-body human mesh recovery from a single RGB image, integrating SMPL-X-based body, hands, and facial expressions with 3D camera-space localization. It uses a Vision Transformer backbone to extract image tokens and a cross-attention based Human Perception Head to regress per-person SMPL-X parameters and depth, optionally incorporating camera intrinsics via Fourier-encoded ray directions. A dedicated synthetic CUFFS dataset enhances hand pose learning, enabling high-fidelity hand/face predictions without high-resolution crops and delivering real-time performance on modest backbones and state-of-the-art results on larger models. The method demonstrates strong capabilities across body-only and whole-body benchmarks, scales well with the number of people, and provides practical utility for AR/VR, robotics, and immersive perception tasks.

Abstract

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.
Paper Structure (22 sections, 8 equations, 10 figures, 12 tables)

This paper contains 22 sections, 8 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 2: Overview of Multi-HMR. A ViT backbone extracts image embeddings. Detection is conducted at the patch level with additional 2D offset regression. Each detected token serves as a query for a cross-attention-based head, called the Human Perception Head (HPH), which predicts pose and shape parameters, along with location in 3D space. Optionally, known camera parameters are embedded and added to each patch, represented as a Fourier encoding of the ray originating from the camera center.
  • Figure 3: (a) The token embeddings corresponding to the $N$ detected primary keypoints are used as queries in a series of cross-attention blocks where keys and values correspond to the context provided by all image tokens. MLPs then predict the SMPL-X parameters (pose and shape) as well as the depth for each query. (b) Samples from our CUFFS synthetic dataset.
  • Figure 4: Backbone-resolution-speed trade-off. We report the performance on MuPoTs, CMU and EHF using different backbone sizes and image resolutions. We also report the inference time (right).
  • Figure 5: Randomly sampled qualitative examples: input image and our results overlaid on it. Images from EHF and AGORA (top), MuPoTs and 3DPW (middle), UBody and CMU (bottom). See supplementary material for more visualizations.
  • Figure 6: Samples from our CUFFS dataset with a rendered human using HumGen3D (top) and the corresponding SMPL-X shape used for retargeting (overlaid at the bottom).
  • ...and 5 more figures