Table of Contents
Fetching ...

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

Chi Su, Xiaoxuan Ma, Jiajun Su, Yizhou Wang

TL;DR

SAT-HMR introduces scale-adaptive tokens to efficiently encode features for real-time, one-stage multi-person 3D mesh estimation from a single RGB image. By predicting a patch-level scale map and selectively upgrading small-scale regions to high resolution while pooling background tokens, the method preserves high-resolution accuracy for challenging cases without incurring prohibitive computation. The approach operates within a DETR-style encoder–decoder framework and uses a mixture of low- and high-resolution tokens, yielding real-time performance (~24 FPS) with competitive accuracy on benchmarks like AGORA and 3DPW, and strong generalization to unseen imagery. This work demonstrates a practical pathway to balance accuracy and efficiency in dense 3D human mesh estimation, with potential applicability to other DETR-based vision tasks.

Abstract

We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods.

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

TL;DR

SAT-HMR introduces scale-adaptive tokens to efficiently encode features for real-time, one-stage multi-person 3D mesh estimation from a single RGB image. By predicting a patch-level scale map and selectively upgrading small-scale regions to high resolution while pooling background tokens, the method preserves high-resolution accuracy for challenging cases without incurring prohibitive computation. The approach operates within a DETR-style encoder–decoder framework and uses a mixture of low- and high-resolution tokens, yielding real-time performance (~24 FPS) with competitive accuracy on benchmarks like AGORA and 3DPW, and strong generalization to unseen imagery. This work demonstrates a practical pathway to balance accuracy and efficiency in dense 3D human mesh estimation, with potential applicability to other DETR-based vision tasks.

Abstract

We propose a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods.

Paper Structure

This paper contains 25 sections, 3 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: (a) We propose scale-adaptive tokens in our one-stage framework for real-time multi-person 3D mesh estimation. Our method introduces scale-adaptive tokens, dynamically adjusted based on the relative size of individuals in the image, to more efficiently encode features, enabling real-time and accurate multi-person mesh estimation. We present a conceptual visualization of the scale-adaptive tokens. The right column visualizes the predicted meshes projected onto an image from 3DPW vonMarcard2018 dataset and from an elevated view. (b) Comparison of estimation error and inference time across different methods, with input resolutions in parentheses. Our method, using a mixed resolution with a base resolution of 644, achieves comparable performance to state-of-the-art methods on AGORA patel2021agora test set while maintaining real-time inference efficiency. Code and models are available at \projpage.
  • Figure 2: Estimation errors and FPS of baselines with different resolutions and our method across individuals at various scales. The scale of an individual refers to the person's size relative to the overall image and please refer to \ref{['subsec:scaleaware']} for mathematical definition. The colored lines show the mve errors (left y-axis) of the baselines with different resolutions (Res.) on the AGORA patel2021agora validation set. The colored markers on the right y-axis indicate the FPS of the corresponding models. Our method adopts a mixed resolution with a base resolution of 644.
  • Figure 3: Overview of (top) the baseline method and (bottom) our method with scale-adaptive tokens.Top: Our baseline method adopts a DETR-style carion2020end pipeline consisting of a Transformer encoder, decoder, and prediction heads for regressing SMPL parameters. Bottom: Our method focuses on efficient feature encoding using scale-adaptive tokens. Specifically, low-resolution and high-resolution patches are extracted from the input images $\mathbf{I}$ and $\mathbf{I}_\text{hr}$, respectively. A scale head network predicts a patch-level scale map $\mathbf{S}$ from the low-resolution tokens, classifying them into three categories: background, small-scale, and large-scale. This scale map guides the pruning and pooling of low-resolution tokens $\mathcal{T}_\text{LR}$ and indicates which patches should be replaced by high-resolution ones. By concatenating the pooled background tokens $\mathcal{T}'_\text{B}$, the remaining large-scale low-resolution tokens $\mathcal{T}_\text{LARGE}$, and the high-resolution tokens $\mathcal{T}_\text{HR}$, we obtain scale-adaptive tokens $\mathcal{T}_\text{SA}$. These tokens are then processed by the encoder, decoder, and multiple prediction heads to regress the human mesh.
  • Figure 4: Comparison with sota methods sun2021monocularsun2022puttingsun2024aiosmultihmr2024 on in-the-wild images from the Internet. Red dashed circles highlight areas with incorrect estimations. The third case is left blank due to the small scale of individuals. Please zoom in for details.
  • Figure 5: Qualitative comparison of different resolutions for our baseline. Resolution for the baseline: (a) 518, (b) 1288. Red dashed circles highlight differences; zoom in for details.
  • ...and 7 more figures