Table of Contents
Fetching ...

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces

Andrea Atzori, Fadi Boutros, Naser Damer, Gianni Fenu, Mirko Marras

TL;DR

This work investigates reducing authentic data demands for face recognition by leveraging synthetic faces. It systematically compares authentic-data baselines to synthetic-data baselines and then studies mixed-domain training with fixed total identities and fixed synthetic identities, coupled with data augmentation. Key findings show that synthetic data alone underperforms authentic data, but combining synthetic data with a small amount of authentic identities yields substantial gains, sometimes surpassing authentic-data baselines, with diffusion-based synthetic data often providing stronger improvements. RandAugment on synthetic data emerges as a promising augmentation strategy in mixed-domain settings, though effects vary by dataset and configuration. Overall, the results support a privacy-aware, synthetic-plus-lewed-authentic data approach to achieving high FR performance, with several avenues for future optimization and fairness considerations.

Abstract

Recent advances in deep face recognition have spurred a growing demand for large, diverse, and manually annotated face datasets. Acquiring authentic, high-quality data for face recognition has proven to be a challenge, primarily due to privacy concerns. Large face datasets are primarily sourced from web-based images, lacking explicit user consent. In this paper, we examine whether and how synthetic face data can be used to train effective face recognition models with reduced reliance on authentic images, thereby mitigating data collection concerns. First, we explored the performance gap among recent state-of-the-art face recognition models, trained with synthetic data only and authentic (scarce) data only. Then, we deepened our analysis by training a state-of-the-art backbone with various combinations of synthetic and authentic data, gaining insights into optimizing the limited use of the latter for verification accuracy. Finally, we assessed the effectiveness of data augmentation approaches on synthetic and authentic data, with the same goal in mind. Our results highlighted the effectiveness of FR trained on combined datasets, particularly when combined with appropriate augmentation techniques.

If It's Not Enough, Make It So: Reducing Authentic Data Demand in Face Recognition through Synthetic Faces

TL;DR

This work investigates reducing authentic data demands for face recognition by leveraging synthetic faces. It systematically compares authentic-data baselines to synthetic-data baselines and then studies mixed-domain training with fixed total identities and fixed synthetic identities, coupled with data augmentation. Key findings show that synthetic data alone underperforms authentic data, but combining synthetic data with a small amount of authentic identities yields substantial gains, sometimes surpassing authentic-data baselines, with diffusion-based synthetic data often providing stronger improvements. RandAugment on synthetic data emerges as a promising augmentation strategy in mixed-domain settings, though effects vary by dataset and configuration. Overall, the results support a privacy-aware, synthetic-plus-lewed-authentic data approach to achieving high FR performance, with several avenues for future optimization and fairness considerations.

Abstract

Recent advances in deep face recognition have spurred a growing demand for large, diverse, and manually annotated face datasets. Acquiring authentic, high-quality data for face recognition has proven to be a challenge, primarily due to privacy concerns. Large face datasets are primarily sourced from web-based images, lacking explicit user consent. In this paper, we examine whether and how synthetic face data can be used to train effective face recognition models with reduced reliance on authentic images, thereby mitigating data collection concerns. First, we explored the performance gap among recent state-of-the-art face recognition models, trained with synthetic data only and authentic (scarce) data only. Then, we deepened our analysis by training a state-of-the-art backbone with various combinations of synthetic and authentic data, gaining insights into optimizing the limited use of the latter for verification accuracy. Finally, we assessed the effectiveness of data augmentation approaches on synthetic and authentic data, with the same goal in mind. Our results highlighted the effectiveness of FR trained on combined datasets, particularly when combined with appropriate augmentation techniques.
Paper Structure (16 sections, 1 equation, 3 figures)

This paper contains 16 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Average verification accuracy on the testing dataset (y-axis) vs. number of authentic identities from CASIA-WebFace (WF) latentofromscratch in the training dataset (x-axis). In all settings, the verification accuracy improved by increasing the number of authentic identities in the training dataset. Also, combining synthetic (black line: DCFace DCFace, green line: ExFaceGAN(GANControl) ExFaceGAN) with a limited subset of authentic data improved the FR performance, in comparison to the case where only a limited subset of authentic identities (red line) is used to train FR. The number of synthetic identities in the combined dataset experiments is fixed (10K).
  • Figure 2: FR Training Paradigm Overview. Subsets of authentic and synthetic data are combined to form the training dataset. During the training phase, only synthetic data is augmented with RandAugment, as well discussed in Section \ref{['sec:results']}. the utilized network architecture in all settings is ResNet50 residualFR trained with CosFace cosface loss.
  • Figure 3: Average verification accuracy on testing data (y-axis) vs. number of authentic identities in training data (x-axis). The top row figures refer to authentic data sampled from WF, second row figures refer to authentic data sampled from M2-S. The verification accuracy improved by increasing the number of authentic identities in the training dataset. Also, combining synthetic with a limited subset of authentic datasets (black and green lines) improved the FR performance in comparison to the case where only a limited subset of authentic identities (red lines) is used to train FR. The number of synthetic identities in the combined dataset experiments is fixed (10K). The results in these plots correspond to the ones reported in Table \ref{['table:RQ1casia']} Table \ref{['table:RQ1ms1m']}.