Table of Contents
Fetching ...

WildAvatar: Learning In-the-wild 3D Avatars from the Web

Zihao Huang, Shoukang Hu, Guangcong Wang, Tianqi Liu, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

TL;DR

This work tackles the scarcity of real-world 3D avatar data by introducing a fully automated web-video annotation pipeline with robust filtering to mine in-the-wild human motions from YouTube. The resulting WildAvatar dataset contains over 10k subjects and scenes, significantly expanding diversity in pose, viewpoint, and clothing without specialized equipment. Empirical results show the pipeline achieves state-of-the-art SMPL annotations on EMDB, improves verification on web videos, and enhances both per-subject and generalizable avatar methods when trained on WildAvatar, with notable gains in PSNR, SSIM, and LPIPS. By enabling large-scale, real-world avatar data and releasing code and data, the work aims to advance practical 3D/4D avatar creation and related tasks.

Abstract

Existing research on avatar creation is typically limited to laboratory datasets, which require high costs against scalability and exhibit insufficient representation of the real world. On the other hand, the web abounds with off-the-shelf real-world human videos, but these videos vary in quality and require accurate annotations for avatar creation. To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. Our pipeline surpasses state-of-the-art methods on the EMDB benchmark, and the filtering protocols boost verification metrics on web videos. We then curate WildAvatar, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with $10000+$ different human subjects and scenes. WildAvatar is at least $10\times$ richer than previous datasets for 3D human avatar creation and closer to the real world. To explore its potential, we demonstrate the quality and generalizability of avatar creation methods on WildAvatar. We will publicly release our code, data source links and annotations to push forward 3D human avatar creation and other related fields for real-world applications.

WildAvatar: Learning In-the-wild 3D Avatars from the Web

TL;DR

This work tackles the scarcity of real-world 3D avatar data by introducing a fully automated web-video annotation pipeline with robust filtering to mine in-the-wild human motions from YouTube. The resulting WildAvatar dataset contains over 10k subjects and scenes, significantly expanding diversity in pose, viewpoint, and clothing without specialized equipment. Empirical results show the pipeline achieves state-of-the-art SMPL annotations on EMDB, improves verification on web videos, and enhances both per-subject and generalizable avatar methods when trained on WildAvatar, with notable gains in PSNR, SSIM, and LPIPS. By enabling large-scale, real-world avatar data and releasing code and data, the work aims to advance practical 3D/4D avatar creation and related tasks.

Abstract

Existing research on avatar creation is typically limited to laboratory datasets, which require high costs against scalability and exhibit insufficient representation of the real world. On the other hand, the web abounds with off-the-shelf real-world human videos, but these videos vary in quality and require accurate annotations for avatar creation. To this end, we propose an automatic annotating pipeline with filtering protocols to curate these humans from the web. Our pipeline surpasses state-of-the-art methods on the EMDB benchmark, and the filtering protocols boost verification metrics on web videos. We then curate WildAvatar, a web-scale in-the-wild human avatar creation dataset extracted from YouTube, with different human subjects and scenes. WildAvatar is at least richer than previous datasets for 3D human avatar creation and closer to the real world. To explore its potential, we demonstrate the quality and generalizability of avatar creation methods on WildAvatar. We will publicly release our code, data source links and annotations to push forward 3D human avatar creation and other related fields for real-world applications.
Paper Structure (17 sections, 4 equations, 12 figures, 5 tables)

This paper contains 17 sections, 4 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Overview of WildAvatar. (a) Unlike previous laboratory datasets for 3D avatar creation, WildAvatar curates in-the-wild web videos. (b) With $10k+$ human subjects and scenes, WildAvatar is at least $10\times$ richer than the previous datasets. (c) It contains high-quality annotations and demonstrates impressive potential to boost the quality and generalizability of avatar-creation methods.
  • Figure 2: The four-stage data processing pipeline. We first obtain the bounding box of key subjects in videos in Stage I and extract human segmentation masks in Stage II. Then, the SMPL and camera parameters are coarsely estimated in Stage III and later refined in Stage IV.
  • Figure 3: Visualizations of filtering protocols. We only retain video clips that (a) show high confidence in human detections; (b) obtain high average confidence in 2D pose estimations; (c) consistency annotated by different expert models; (d) consistency on keypoints between projected SMPL keypoints and 2D pose estimations; and (e) consistency on segmentation masks between Segment-Anything and SMPL.
  • Figure 4: Data Analysis: (a) word cloud of the video titles in WildAvatar, (b) histograms of annotations across video clips, here we count the bounding box and human mask region in pixels, and "Range" denotes the difference between the maximum and minimum values. (c) comparison of the body pose spaces with popular laboratory human datasets, (d) comparison of the viewpoints spaces with popular laboratory human datasets, (e) resolutions of videos in WildAvatar, and (f) comparison with the previous dataset on the abundance of clothing. We introduce the SSIOU, the inverse IOU between SMPL SMPLSMPLX masks and segmentation masks.
  • Figure 5: Qualitative comparisons of avatars created with our annotations and the state-of-the-art HMR2.0. IN and GH denote InstantNVR and GauHuman, respectively. More accurate avatars can be created with our annotations. Human faces are blurred.
  • ...and 7 more figures