Table of Contents
Fetching ...

Playing for 3D Human Recovery

Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu

TL;DR

The paper introduces GTA-Human, a large-scale synthetic dataset with 1.4 million SMPL annotations from GTA-V to advance 3D human recovery. It demonstrates that mixing synthetic GTA-Human data with real data improves both image- and video-based methods, sometimes outperforming much more complex baselines, and highlights domain-gap dynamics and the complementary value of synthetic data. The study shows dataset scale, strong SMPL supervision, and backbone capacity all amplify gains, with deeper models benefiting most from large synthetic corpora. The work argues that game-playing data offers a scalable, cost-effective path toward robust 3D human pose and shape estimation in the wild, and outlines practical guidelines for data mixture, domain adaptation, and model design.

Abstract

Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. Our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study reveals the model sensitivity to data density from multiple key aspects. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world. Homepage: https://caizhongang.github.io/projects/GTA-Human/

Playing for 3D Human Recovery

TL;DR

The paper introduces GTA-Human, a large-scale synthetic dataset with 1.4 million SMPL annotations from GTA-V to advance 3D human recovery. It demonstrates that mixing synthetic GTA-Human data with real data improves both image- and video-based methods, sometimes outperforming much more complex baselines, and highlights domain-gap dynamics and the complementary value of synthetic data. The study shows dataset scale, strong SMPL supervision, and backbone capacity all amplify gains, with deeper models benefiting most from large synthetic corpora. The work argues that game-playing data offers a scalable, cost-effective path toward robust 3D human pose and shape estimation in the wild, and outlines practical guidelines for data mixture, domain adaptation, and model design.

Abstract

Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. Our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study reveals the model sensitivity to data density from multiple key aspects. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world. Homepage: https://caizhongang.github.io/projects/GTA-Human/

Paper Structure

This paper contains 26 sections, 2 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: GTA-Human dataset is built from GTA-V, an open-world action game that features a reasonably realistic functioning metropolis and virtual characters living in it. Our customized toolchain enables large-scale collection and annotation of highly diverse human data that we hope aid in-depth studies on 3D human recovery. We show here a few examples with SMPL annotations overlaid on the virtual humans.
  • Figure 2: Data collection toolchain. Our toolchain is highly scalable as the cloud services are used to coordinate a large number of computation workers. Left: the overview of the pipeline. Top right: an elaborate illustration of Local GUI Worker. Bottom right: an elaborate illustration of Cluster Worker.
  • Figure 3: Data diversity in GTA-Human.(a) GTA-Human contains subjects of varied genders, ages, skin tones, clothing and body shapes. (b) locations with diverse backgrounds. The example locations are pinpointed on the 3D game world map. We discover in Section \ref{['sec:experiments:the_unreasonable_effectiveness_of_data']} that the outdoor scenes are critical to the usefulness of GTA-Human. (c) Different weather conditions. (d) In-game time is set to capture diverse lighting conditions. We capture the same scene at one game hour interval. Note the shadow direction is affected by the sun's position.
  • Figure 4: Actions. GTA-Human contains 20 thousand actions that are expressive and diverse. (a) The distribution of poses in GTA-Human and real datasets are visualized after PCA dimension reduction. (b) We show five pose sequences, represented by curves. Representative frames of sequence 1-5 are indicated by the diamond-shaped nodes. Datasets are downsampled proportionally.
  • Figure 5: Camera angles.(a) Visualization of camera angles sampled from various datasets, normalized to a unit sphere. (b) Elevation angle (up-down, with positive value indicating a camera placed higher than the waist and looking down) distributions. The vertical axis represents normalized data density. The colors of the points in (a) and line plots in (b) represent different datasets, shown in the legend in (b).
  • ...and 6 more figures