Table of Contents
Fetching ...

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

Orest Kupyn, Eugene Khvedchenia, Christian Rupprecht

TL;DR

VGGHeads addresses privacy and generalization gaps in 3D human head modeling by introducing a large-scale synthetic dataset generated with diffusion models and a multi-head 3D mesh reconstruction model. The proposed architecture extends YOLO-NAS to jointly predict head bounding boxes and 3DMM/FLAME parameters, enabling single-pass reconstruction of multiple heads from RGB images. Experiments demonstrate strong transfer from synthetic to real imagery across head pose estimation, 3D head alignment, and face detection, supported by extensive ablations. The work also emphasizes privacy safeguards and provides dataset, code, and model releases to accelerate research in 3D head modeling and related tasks.

Abstract

Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.

VGGHeads: 3D Multi Head Alignment with a Large-Scale Synthetic Dataset

TL;DR

VGGHeads addresses privacy and generalization gaps in 3D human head modeling by introducing a large-scale synthetic dataset generated with diffusion models and a multi-head 3D mesh reconstruction model. The proposed architecture extends YOLO-NAS to jointly predict head bounding boxes and 3DMM/FLAME parameters, enabling single-pass reconstruction of multiple heads from RGB images. Experiments demonstrate strong transfer from synthetic to real imagery across head pose estimation, 3D head alignment, and face detection, supported by extensive ablations. The work also emphasizes privacy safeguards and provides dataset, code, and model releases to accelerate research in 3D head modeling and related tasks.

Abstract

Human head detection, keypoint estimation, and 3D head model fitting are essential tasks with many applications. However, traditional real-world datasets often suffer from bias, privacy, and ethical concerns, and they have been recorded in laboratory environments, which makes it difficult for trained models to generalize. Here, we introduce \method -- a large-scale synthetic dataset generated with diffusion models for human head detection and 3D mesh estimation. Our dataset comprises over 1 million high-resolution images, each annotated with detailed 3D head meshes, facial landmarks, and bounding boxes. Using this dataset, we introduce a new model architecture capable of simultaneous head detection and head mesh reconstruction from a single image in a single step. Through extensive experimental evaluations, we demonstrate that models trained on our synthetic data achieve strong performance on real images. Furthermore, the versatility of our dataset makes it applicable across a broad spectrum of tasks, offering a general and comprehensive representation of human heads.
Paper Structure (24 sections, 3 equations, 14 figures, 12 tables)

This paper contains 24 sections, 3 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Data Generation. The predicted 2D body pose yolonas and scene description blip condition the image generation process. Binary detection model predicts head bounding boxes and 3DMM regressor dad3d generates final annotation for each head crop.
  • Figure 2: Model Architecture. VGGHeads extends YOLO-NAS yolonas architecture to predict the 3D Morphable Model parameters along with the head bounding boxes from the multi-scale feature maps.
  • Figure 3: Versatility. Our model is able to predict many types of head annotations and works across all datasets.
  • Figure 4: Head Alignment. VGGHeads introduce more consistent alignment across various poses by ensuring the center of the head in 3D is reprojected to the center of the aligned image. VGGHeads, RetinaFace retinaface.
  • Figure 5: Re-ID on Celeb-A. We automatically detect and removed samples where face in the generated image is matched with one of the faces from Celeb-A dataset celeba.
  • ...and 9 more figures