Table of Contents
Fetching ...

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

Qingping Sun, Yanjun Wang, Ailing Zeng, Wanqi Yin, Chen Wei, Wenjia Wang, Haiyi Mei, Chi Sing Leung, Ziwei Liu, Lei Yang, Zhongang Cai

TL;DR

A novel ali-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step is introduced, built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection.

Abstract

Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.

AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation

TL;DR

A novel ali-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step is introduced, built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection.

Abstract

Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body, hand, and expression estimation. Most existing methods have tackled this task in a two-stage manner, first detecting the human body part with an off-the-shelf detection model and inferring the different human body parts individually. Despite the impressive results achieved, these methods suffer from 1) loss of valuable contextual information via cropping, 2) introducing distractions, and 3) lacking inter-association among different persons and body parts, inevitably causing performance degradation, especially for crowded scenes. To address these issues, we introduce a novel all-in-one-stage framework, AiOS, for multiple expressive human pose and shape recovery without an additional human detection step. Specifically, our method is built upon DETR, which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically, we first employ a human token to probe a human location in the image and encode global features for each instance, which provides a coarse location for the later transformer block. Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature, which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9% reduction in NMVE on AGORA, a 30% reduction in PVE on EHF, a 10% reduction in PVE on ARCTIC, and a 3% reduction in PVE on EgoBody.
Paper Structure (23 sections, 9 figures, 10 tables)

This paper contains 23 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: A comparison of existing methods in EHPS. (a) Top-down, multi-stage methods, typically use detectors to detect humans, then use different networks to regress body parts on cropped images. (b) Top-down, one-stage methods, use only one network for regression but still require detectors and rely on the cropped image. (c) Our all-in-one-stage pipeline, end-to-end human detection, and regression on full frame.
  • Figure 2: Pipeline overview. AiOS performs human localization and SMPL-X estimation in a progressive manner. It is composed of (1) the body localization stage that predicts coarse human location; (2) the Body refinement stage that refines body features and produces face and hand locations; (3) the Whole-body Refinement stage that refines whole-body features and regress SMPL-X parameters.
  • Figure 3: Comparison of current SOTA methods smplerxhand4wholeosx with our AiOS model. The upper part is visualization results on AGORA agora, and the lower is EHF test expose.
  • Figure 4: Visual comparisons with SOTA one-stage HPS methods rompbev on the Internet data.
  • Figure 5: Attention Visualization. The green dots represent the location of the reference point, and the red dots are the sampling points.
  • ...and 4 more figures