HyperFace: Generating Synthetic Face Recognition Datasets by Exploring Face Embedding Hypersphere
Hatef Otroshi Shahreza, Sébastien Marcel
TL;DR
Privacy concerns around web-crawled face datasets are addressed by generating synthetic data for training. HyperFace formulates dataset generation as packing reference embeddings on the identity embedding hypersphere and solves via gradient-based optimization with a manifold-regularization term, followed by diffusion-based image synthesis. Empirical results show state-of-the-art or competitive performance on multiple real benchmarks when models are trained with HyperFace data, and the approach scales with more identities and images. The work highlights ethical considerations and potential privacy risks, offering a practical pathway to privacy-preserving, scalable synthetic face data while acknowledging remaining challenges.
Abstract
Face recognition datasets are often collected by crawling Internet and without individuals' consents, raising ethical and privacy concerns. Generating synthetic datasets for training face recognition models has emerged as a promising alternative. However, the generation of synthetic datasets remains challenging as it entails adequate inter-class and intra-class variations. While advances in generative models have made it easier to increase intra-class variations in face datasets (such as pose, illumination, etc.), generating sufficient inter-class variation is still a difficult task. In this paper, we formulate the dataset generation as a packing problem on the embedding space (represented on a hypersphere) of a face recognition model and propose a new synthetic dataset generation approach, called HyperFace. We formalize our packing problem as an optimization problem and solve it with a gradient descent-based approach. Then, we use a conditional face generator model to synthesize face images from the optimized embeddings. We use our generated datasets to train face recognition models and evaluate the trained models on several benchmarking real datasets. Our experimental results show that models trained with HyperFace achieve state-of-the-art performance in training face recognition using synthetic datasets.
