Table of Contents
Fetching ...

SynthForge: Synthesizing High-Quality Face Dataset with Controllable 3D Generative Models

Abhay Rawat, Shubham Dokania, Astitva Srivastava, Shuaib Ahmed, Haiwen Feng, Rahul Tallamraju

TL;DR

SynthForge introduces a 3D-aware synthetic data pipeline that leverages Next3D and FLAME-based 3DMMs to generate high-fidelity face images with dense multi-task annotations. A Multi-Task Synthetic Backbone is trained on this synthetic data to predict segmentation, depth, and keypoints, followed by Stage II label finetuning on real data to bridge domain gaps. The approach yields competitive results on benchmarks like 300W, LaPa, and CelebAMask-HQ, while significantly reducing data collection costs (e.g., 100k images in ~7 hours on a single V100). By releasing the curated synthetic dataset, pretrained models, and code, SynthForge provides a scalable path for realistic, controllable synthetic data in facial analysis tasks.

Abstract

Recent advancements in generative models have unlocked the capabilities to render photo-realistic data in a controllable fashion. Trained on the real data, these generative models are capable of producing realistic samples with minimal to no domain gap, as compared to the traditional graphics rendering. However, using the data generated using such models for training downstream tasks remains under-explored, mainly due to the lack of 3D consistent annotations. Moreover, controllable generative models are learned from massive data and their latent space is often too vast to obtain meaningful sample distributions for downstream task with limited generation. To overcome these challenges, we extract 3D consistent annotations from an existing controllable generative model, making the data useful for downstream tasks. Our experiments show competitive performance against state-of-the-art models using only generated synthetic data, demonstrating potential for solving downstream tasks. Project page: https://synth-forge.github.io

SynthForge: Synthesizing High-Quality Face Dataset with Controllable 3D Generative Models

TL;DR

SynthForge introduces a 3D-aware synthetic data pipeline that leverages Next3D and FLAME-based 3DMMs to generate high-fidelity face images with dense multi-task annotations. A Multi-Task Synthetic Backbone is trained on this synthetic data to predict segmentation, depth, and keypoints, followed by Stage II label finetuning on real data to bridge domain gaps. The approach yields competitive results on benchmarks like 300W, LaPa, and CelebAMask-HQ, while significantly reducing data collection costs (e.g., 100k images in ~7 hours on a single V100). By releasing the curated synthetic dataset, pretrained models, and code, SynthForge provides a scalable path for realistic, controllable synthetic data in facial analysis tasks.

Abstract

Recent advancements in generative models have unlocked the capabilities to render photo-realistic data in a controllable fashion. Trained on the real data, these generative models are capable of producing realistic samples with minimal to no domain gap, as compared to the traditional graphics rendering. However, using the data generated using such models for training downstream tasks remains under-explored, mainly due to the lack of 3D consistent annotations. Moreover, controllable generative models are learned from massive data and their latent space is often too vast to obtain meaningful sample distributions for downstream task with limited generation. To overcome these challenges, we extract 3D consistent annotations from an existing controllable generative model, making the data useful for downstream tasks. Our experiments show competitive performance against state-of-the-art models using only generated synthetic data, demonstrating potential for solving downstream tasks. Project page: https://synth-forge.github.io
Paper Structure (17 sections, 2 equations, 4 figures, 3 tables)

This paper contains 17 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our 3D-aware synthetic data pipeline produces high-fidelity images with dense multi-task spatial annotations. Compared to physically based rendering methods, we provide improvements in terms of compute, cost, and time efficiency. The generated dataset is further used for training facial analysis models for the tasks of semantic segmentation, depth estimation, and keypoint estimation.
  • Figure 2: This figure illustrates a comprehensive framework for facial analysis leveraging synthetic data. (A) involves generating a synthetic dataset using a 3D-aware generative model, with annotations derived from FLAME parameters, to (B) train a multi-task prediction module ($\mathcal{M}$) with self-supervision loss in Stage I. (C) Stage II utilizes the pre-trained module to fine-tune predictions on real-world data.
  • Figure 3: UV Texture map for the FLAME head model showing semantic labels which are used to generate the segmentation annotations for our proposed data generation pipeline. These segmentation masks can be updated through the UV map to adapt to finer or coarser level of semantic information.
  • Figure 4: Qualitative results on face alignment, face parsing and depth estimation task. From left to right: a) Ground truth (GT) landmarks (blue) with predicted landmarks from SB model, b) GT landmarks (blue) with predicted SB + LF w/ $f_s$ landmarks c) GT face parsing, d) predicted face parsing with SB , e) predicted face parsing with SB + LF + w/ $f_s$, f) GT depth map, d) predicted depth map with SB , e) predicted depth map with SB + LF + w/ $f_s$