Table of Contents
Fetching ...

Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities

Hatef Otroshi Shahreza, Sébastien Marcel

TL;DR

This work designs a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model, and shows the leakage from training data of generator models into the generated synthetic face recognition datasets.

Abstract

Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.

Unveiling Synthetic Faces: How Synthetic Datasets Can Expose Real Identities

TL;DR

This work designs a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model, and shows the leakage from training data of generator models into the generated synthetic face recognition datasets.

Abstract

Synthetic data generation is gaining increasing popularity in different computer vision applications. Existing state-of-the-art face recognition models are trained using large-scale face datasets, which are crawled from the Internet and raise privacy and ethical concerns. To address such concerns, several works have proposed generating synthetic face datasets to train face recognition models. However, these methods depend on generative models, which are trained on real face images. In this work, we design a simple yet effective membership inference attack to systematically study if any of the existing synthetic face recognition datasets leak any information from the real data used to train the generator model. We provide an extensive study on 6 state-of-the-art synthetic face recognition datasets, and show that in all these synthetic datasets, several samples from the original real dataset are leaked. To our knowledge, this paper is the first work which shows the leakage from training data of generator models into the generated synthetic face recognition datasets. Our study demonstrates privacy pitfalls in synthetic face recognition datasets and paves the way for future studies on generating responsible synthetic face datasets.

Paper Structure

This paper contains 7 sections, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: Sample face images leaked from training data (first row) of generative models in different state-of-the-art synthetic face recognition datasets (second row).
  • Figure 2: Schematic diagram of data leakage from generator's training data into generated synthetic face recognition dataset.
  • Figure 3: Histogram of cosine similarity scores of all retrieved pairs of images for each synthetic dataset and their corresponding values of similarity for top-k pairs as dashed vertical lines (k=1500). The first plot shows the histogram of similarity scores for positive and negative pairs in IJB-C dataset (benchmark) and the threshold for FAR=0.01% on IJB-C with dotted vertical lines.
  • Figure 4: Sample face images leaked from training data (first row) of the generative model in the DCFace dataset (second row). For more samples see Fig. \ref{['fig:sample_dcface_more']}
  • Figure 5: Sample face images leaked from training data (first row) of the generative model in the IDiff-Face (Uniform) dataset (second row).
  • ...and 9 more figures