Table of Contents
Fetching ...

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Yanjun Wang, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, Ziwei Liu

TL;DR

This work addresses the generalization gap in expressive human pose and shape estimation (EHPS) by scaling both data and model capacity. It introduces SMPLer-X, a ViT-based foundation model trained on up to 4.5M instances drawn from 32 EHPS datasets, achieving strong cross-domain performance and state-of-the-art results on seven benchmarks. A first systematic EHPS benchmark is presented, revealing dataset inconsistencies, domain gaps, and guiding data selection strategies, including the value of synthetic data and pseudo-SMPL-X labels. The authors further show that finetuning the foundation model into domain-specific specialists yields additional SOTA gains, offering a plug-and-play backbone and practical guidelines for rapid adaptation in EHPS applications.

Abstract

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/

SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

TL;DR

This work addresses the generalization gap in expressive human pose and shape estimation (EHPS) by scaling both data and model capacity. It introduces SMPLer-X, a ViT-based foundation model trained on up to 4.5M instances drawn from 32 EHPS datasets, achieving strong cross-domain performance and state-of-the-art results on seven benchmarks. A first systematic EHPS benchmark is presented, revealing dataset inconsistencies, domain gaps, and guiding data selection strategies, including the value of synthetic data and pseudo-SMPL-X labels. The authors further show that finetuning the foundation model into domain-specific specialists yields additional SOTA gains, offering a plug-and-play backbone and practical guidelines for rapid adaptation in EHPS applications.

Abstract

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/
Paper Structure (28 sections, 1 equation, 14 figures, 22 tables)

This paper contains 28 sections, 1 equation, 14 figures, 22 tables.

Figures (14)

  • Figure 1: Scaling up EHPS. Both data and model scaling are effective in reducing mean errors on primary metrics across key benchmarks: AGORA patel2021agora, UBody lin2023one, EgoBody zhang2022egobody, 3DPW von2018recovering and EHF pavlakos2019expressive. OSX lin2023one and H4W GyeongsikMoon2020hand4whole are SOTA methods. Area of the circle indicates model size, with ViT variants as the reference (top right).
  • Figure 2: Dataset attribute distributions. a) and d) are image feature extracted by HumanBench tang2023humanbench and OSX lin2023one pretrained ViT-L backbone. b) Global orientation (represented by rotation matrix) distribution. c) Body pose (represented by 3D skeleton joints) distribution. Both e) scenes and f) Real/Synthetic are drawn on the same distribution as d). All: all datasets. UMAP mcinnes2018umap dimension reduction is used with the x and y-axis as the dimensions of the embedded space (no unit).
  • Figure 3: Analysis on dataset attributes. We study the impact of a) the number of training instances, b) scenes, c) real or synthetic appearance, and d) annotation type, on dataset ranking in Table\ref{['tab:single_datasets']}.
  • Figure 4: Architecture of SMPLer-X, which upholds the idea that "simplicity is beauty". SMPLer-X contains a backbone that allows for easy investigation on model scaling, a neck for hand and face feature cropping, and heads for different body parts. Note that we wish to show in this work that model and data scaling are effective, even with a straightforward architecture.
  • Figure 5: Visualization. We compare SMPLer-X-L32 with OSX lin2023one and Hand4Whole GyeongsikMoon2020hand4whole (trained with the MSCOCO, MPII, and Human3.6M) in various scenarios such as those with heavy truncation, hard poses, and rare camera angles.
  • ...and 9 more figures