Table of Contents
Fetching ...

PromptHMR: Promptable Human Mesh Recovery

Yufu Wang, Yu Sun, Priyanka Patel, Kostas Daniilidis, Michael J. Black, Muhammed Kocabas

TL;DR

PromptHMR reframes 3D human pose and shape estimation as a promptable regression problem that leverages full-image context and multimodal side information. By combining aVision Transformer image backbone, a flexible prompt encoder for spatial and semantic cues, and a transformer SMPL-X decoder with optional interaction-attention, PromptHMR achieves state-of-the-art accuracy on both image- and video-based HPS benchmarks. The approach supports various prompts (boxes, masks, text descriptions, interaction labels) and can handle crowds, occlusions, and person-person interactions while maintaining coherent camera-space and world-space reconstructions. PromptHMR-Vid extends the framework to video with a temporal transformer and world-coordinate motion via SLAM-depth integration, yielding robust, temporally coherent motion estimates suitable for in-the-wild analysis and potential robotics or AR applications.

Abstract

Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

PromptHMR: Promptable Human Mesh Recovery

TL;DR

PromptHMR reframes 3D human pose and shape estimation as a promptable regression problem that leverages full-image context and multimodal side information. By combining aVision Transformer image backbone, a flexible prompt encoder for spatial and semantic cues, and a transformer SMPL-X decoder with optional interaction-attention, PromptHMR achieves state-of-the-art accuracy on both image- and video-based HPS benchmarks. The approach supports various prompts (boxes, masks, text descriptions, interaction labels) and can handle crowds, occlusions, and person-person interactions while maintaining coherent camera-space and world-space reconstructions. PromptHMR-Vid extends the framework to video with a temporal transformer and world-coordinate motion via SLAM-depth integration, yielding robust, temporally coherent motion estimates suitable for in-the-wild analysis and potential robotics or AR applications.

Abstract

Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.

Paper Structure

This paper contains 21 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: PromptHMR is a promptable human pose and shape (HPS) estimation method that processes images with spatial or semantic prompts. It takes "side information" readily available from vision-language models or user input to improve the accuracy and robustness of 3D HPS. PromptHMR recovers human pose and shape from spatial prompts such as (a) face bounding boxes, (b) partial or complete person detection boxes, or (c) segmentation masks. It refines its predictions using semantic prompts such as (c) person-person interaction labels for close contact scenarios, or (d) natural language descriptions of body shape to improve body shape predictions. Both image and video versions of PromptHMR achieve state-of-the-art accuracy.
  • Figure 2: Method overview. PromptHMR estimates SMPL-X parameters for each person in an image based on various types of prompts, such as boxes, language descriptions, and person-person interaction cues. Given an image and prompts, we utilize a vision transformer to generate image embeddings and mask and prompt encoders to map different types of prompts to tokens. Optionally, camera intrinsics can be embedded along with the image embeddings. The image embeddings and prompt tokens are then fed to the SMPL-X decoder. The SMPL-X decoder is a transformer-based module that attends to both the image and prompt tokens to estimate SMPL-X parameters. Note that the language and interaction prompts are optional, but providing them enhances the accuracy of the estimated SMPL-X parameters.
  • Figure 3: SMPL-X decoder. The top row shows one attention block in the decoder. The cross-person interaction module can be turned on/off. The bottom row shows the cross-person attention.
  • Figure 4: Effect of box prompts. Our method remains stable with different boxes, including noisy truncated boxes.
  • Figure 5: Effect of mask prompts. Results are from the same model with different prompt inputs. Masks are better for close interaction scenarios where boxes are ambiguous.
  • ...and 4 more figures