Table of Contents
Fetching ...

VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

Michael Yeung, Toya Teramoto, Songtao Wu, Tatsuo Fujiwara, Kenji Suzuki, Tamaki Kojima

TL;DR

The paper tackles privacy and bias concerns in real face datasets by proposing VariFace, a two-stage diffusion pipeline that generates fair and diverse synthetic faces for FR training. It combines Face Recognition Consistency for refined demographic labels, Face Vendi Score Guidance to boost interclass diversity, and Divergence Score Conditioning to balance identity preservation with intraclass diversity. Empirical results show VariFace matches real-data accuracy at constrained sizes and surpasses it as dataset size grows, achieving a new state-of-the-art average verification accuracy across six benchmarks while improving minority fairness. This work demonstrates that high-performing, fair FR models can be trained primarily on synthetic data, offering a scalable and privacy-preserving alternative to web-scraped real datasets.

Abstract

The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 $\rightarrow$ 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.

VariFace: Fair and Diverse Synthetic Dataset Generation for Face Recognition

TL;DR

The paper tackles privacy and bias concerns in real face datasets by proposing VariFace, a two-stage diffusion pipeline that generates fair and diverse synthetic faces for FR training. It combines Face Recognition Consistency for refined demographic labels, Face Vendi Score Guidance to boost interclass diversity, and Divergence Score Conditioning to balance identity preservation with intraclass diversity. Empirical results show VariFace matches real-data accuracy at constrained sizes and surpasses it as dataset size grows, achieving a new state-of-the-art average verification accuracy across six benchmarks while improving minority fairness. This work demonstrates that high-performing, fair FR models can be trained primarily on synthetic data, offering a scalable and privacy-preserving alternative to web-scraped real datasets.

Abstract

The use of large-scale, web-scraped datasets to train face recognition models has raised significant privacy and bias concerns. Synthetic methods mitigate these concerns and provide scalable and controllable face generation to enable fair and accurate face recognition. However, existing synthetic datasets display limited intraclass and interclass diversity and do not match the face recognition performance obtained using real datasets. Here, we propose VariFace, a two-stage diffusion-based pipeline to create fair and diverse synthetic face datasets to train face recognition models. Specifically, we introduce three methods: Face Recognition Consistency to refine demographic labels, Face Vendi Score Guidance to improve interclass diversity, and Divergence Score Conditioning to balance the identity preservation-intraclass diversity trade-off. When constrained to the same dataset size, VariFace considerably outperforms previous synthetic datasets (0.9200 0.9405) and achieves comparable performance to face recognition models trained with real data (Real Gap = -0.0065). In an unconstrained setting, VariFace not only consistently achieves better performance compared to previous synthetic methods across dataset sizes but also, for the first time, outperforms the real dataset (CASIA-WebFace) across six evaluation datasets. This sets a new state-of-the-art performance with an average face verification accuracy of 0.9567 (Real Gap = +0.0097) across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets and 0.9366 (Real Gap = +0.0380) on the RFW dataset.

Paper Structure

This paper contains 29 sections, 8 equations, 11 figures, 15 tables, 1 algorithm.

Figures (11)

  • Figure 1: Face verification accuracy using synthetic datasets. Face verification accuracy is the average performance across LFW, CFP-FP, CPLFW, AgeDB, and CALFW datasets. VariFace is trained only with data from CASIA-WebFace (real), the performance of which is shown for reference. All other results are taken from their respective papers.
  • Figure 2: VariFace Training and Inference Pipeline. Training: Predictions for race (R*), gender (G*), and age (A) are extracted using a pretrained CLIP model. Next, a pretrained FR model is used to refine race (R) and gender (G) labels, as well as compute identity (ID) embeddings and divergence scores (DS). These labels are used to train conditional diffusion models to generate interclass and intraclass variation in stage 1 and 2, respectively. Inference: The stage 1 diffusion model generates a balanced dataset of synthetic identities, which are subsequently filtered and processed with a pretrained FR model to generate a set of synthetic embeddings. The synthetic ID embeddings and randomly sampled A and DS are used as conditions for the stage 2 diffusion model to generate a synthetic face dataset, which is passed through the second stage filter to create the filtered synthetic dataset.
  • Figure 3: Divergence Score Conditioning. By varying the divergence scores applied during sampling, the diversity in generated images can be controlled. From top to bottom, the DS values used are 0.4, 0.6, and 0.8, respectively. All the images are derived from the same synthetic identity.
  • Figure 4: Synthetic dataset characteristics. Top: t-SNE plots of mean face embeddings for identities in different synthetic datasets. The race and gender labels for each embedding are represented by different colors defined in the legend. Bottom: Histogram of divergence scores for different synthetic datasets. The regions where cosine similarity score $<0.3$ and $>0.9$ are shaded in red. CASIA-WebFace (real dataset) is shown for reference.
  • Figure S1: CLIP demographic labeling. Using a pretrained pair of CLIP image and text encoders, the cosine similarities between the image and text embeddings are computed and then converted into softmax probabilities. The final label is obtained after averaging softmax probabilities across values obtained from the image and flipped image embeddings.
  • ...and 6 more figures