Table of Contents
Fetching ...

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

Wanqi Yin, Zhongang Cai, Ruisi Wang, Ailing Zeng, Chen Wei, Qingping Sun, Haiyi Mei, Yanjun Wang, Hui En Pang, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Atsushi Yamashita, Lei Yang, Ziwei Liu

TL;DR

The paper addresses the generalization gap in expressive human pose and shape estimation (EHPS) by assembling 40 diverse datasets and studying data- and model-scaling with minimalist foundation models SMPLer-X and SMPLest-X. It demonstrates that large-scale data and ViT-based backbones yield strong cross-domain performance and transfer to unseen environments, with finetuning enabling domain-specific specialists to reach state-of-the-art on multiple benchmarks; it also introduces SynHand for focused hand evaluation and the MPE metric for holistic generalization. Key findings show diminishing returns beyond about 10 million training instances within a fixed data domain, highlighting the need for algorithmic innovations alongside data expansion. The work provides a practical benchmark, scalable, reusable foundation models, and a pathway to rapid adaptation via benchmark-guided finetuning for robust EHPS in-the-wild applications.

Abstract

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).

SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation

TL;DR

The paper addresses the generalization gap in expressive human pose and shape estimation (EHPS) by assembling 40 diverse datasets and studying data- and model-scaling with minimalist foundation models SMPLer-X and SMPLest-X. It demonstrates that large-scale data and ViT-based backbones yield strong cross-domain performance and transfer to unseen environments, with finetuning enabling domain-specific specialists to reach state-of-the-art on multiple benchmarks; it also introduces SynHand for focused hand evaluation and the MPE metric for holistic generalization. Key findings show diminishing returns beyond about 10 million training instances within a fixed data domain, highlighting the need for algorithmic innovations alongside data expansion. The work provides a practical benchmark, scalable, reusable foundation models, and a pathway to rapid adaptation via benchmark-guided finetuning for robust EHPS in-the-wild applications.

Abstract

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods focus on training innovative architectural designs on confined datasets. In this work, we investigate the impact of scaling up EHPS towards a family of generalist foundation models. 1) For data scaling, we perform a systematic investigation on 40 EHPS datasets, encompassing a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. Ultimately, we achieve diminishing returns at 10M training instances from diverse data sources. 2) For model scaling, we take advantage of vision transformers (up to ViT-Huge as the backbone) to study the scaling law of model sizes in EHPS. To exclude the influence of algorithmic design, we base our experiments on two minimalist architectures: SMPLer-X, which consists of an intermediate step for hand and face localization, and SMPLest-X, an even simpler version that reduces the network to its bare essentials and highlights significant advances in the capture of articulated hands. With big data and the large model, the foundation models exhibit strong performance across diverse test benchmarks and excellent transferability to even unseen environments. Moreover, our finetuning strategy turns the generalist into specialist models, allowing them to achieve further performance boosts. Notably, our foundation models consistently deliver state-of-the-art results on seven benchmarks such as AGORA, UBody, EgoBody, and our proposed SynHand dataset for comprehensive hand evaluation. (Code is available at: https://github.com/wqyin/SMPLest-X).
Paper Structure (33 sections, 1 equation, 19 figures, 24 tables)

This paper contains 33 sections, 1 equation, 19 figures, 24 tables.

Figures (19)

  • Figure 1: Scaling up EHPS. a) Whole-body and b) hand-only mean primary error (MPE) indicate both data and model scaling are effective in reducing mean errors on primary metrics across key benchmarks for: AGORA patel2021agora, UBody lin2023one, EgoBody zhang2022egobody, 3DPW von2018recovering and EHF Pavlakos_2019smplx. OSX lin2023one and HybrIK-X li2023hybrik are SOTA methods. Area of the circle indicates model size, with ViT variants as the reference (top right in the left figure).
  • Figure 2: Dataset attribute distributions. a) and d) are image feature extracted by HumanBench tang2023humanbench and OSX lin2023one pretrained ViT-L backbone. b) Global orientation (represented by rotation matrix) distribution. c) Body pose (represented by 3D skeleton joints) distribution. Both e) scenes and f) Real/eot are drawn on the same distribution as d). All: all datasets. UMAP mcinnes2018umap dimension reduction is used in all visualization with the x and y-axis as the dimensions of the embedded space.
  • Figure 3: Analysis of hand poses in whole-body datasets. The distribution of the distance to the relaxed hand pose of each dataset (top) is shown along with the illustration of pose complexity at various distances (bottom). The hand pose with a lower distance is more similar to the relaxed pose.
  • Figure 4: Visualization of SynHand dataset with complex hand poses and accurate annotations.
  • Figure 5: Architecture of SMPLest-X. Compared with other frameworks with algorithmic modules in various stages (bottom), SMPLest-X (top) has a minimalistic framework design in all three stages. Noted that SMPLer-X consists B. Component Guiding module in the decoder stage.
  • ...and 14 more figures