Table of Contents
Fetching ...

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

Trung Tuan Dao, Duc Hong Vu, Cuong Pham, Anh Tran

TL;DR

EFHQ introduces a large-scale, high-quality extreme-pose facial dataset (~450k images) harvested from VFHQ and CelebV-HQ via a robust, ensemble pose-labeling and manual-review pipeline. It demonstrates multi-task utility by enhancing 2D/3D face generation (StyleGAN2-ADA, EG3D) and diffusion-based methods (ControlNet), improves face reenactment with augmented training data, and exposes pose-induced weaknesses in face recognition through a cross-view verification benchmark. The work provides targeted auxiliary datasets, detailed processing hyperparameters, and a rigorous evaluation framework, including human surveys, to validate EFHQ’s benefits and encourage broader adoption. Overall, EFHQ fills a critical pose-diversity gap and enables more reliable cross-pose synthesis, reenactment, and verification in real-world settings.

Abstract

The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild.

EFHQ: Multi-purpose ExtremePose-Face-HQ dataset

TL;DR

EFHQ introduces a large-scale, high-quality extreme-pose facial dataset (~450k images) harvested from VFHQ and CelebV-HQ via a robust, ensemble pose-labeling and manual-review pipeline. It demonstrates multi-task utility by enhancing 2D/3D face generation (StyleGAN2-ADA, EG3D) and diffusion-based methods (ControlNet), improves face reenactment with augmented training data, and exposes pose-induced weaknesses in face recognition through a cross-view verification benchmark. The work provides targeted auxiliary datasets, detailed processing hyperparameters, and a rigorous evaluation framework, including human surveys, to validate EFHQ’s benefits and encourage broader adoption. Overall, EFHQ fills a critical pose-diversity gap and enables more reliable cross-pose synthesis, reenactment, and verification in real-world settings.

Abstract

The existing facial datasets, while having plentiful images at near frontal views, lack images with extreme head poses, leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ), which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset, we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets, VFHQ and CelebV-HQ, which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks, such as facial synthesis with 2D/3D-aware GAN, diffusion-based text-to-image face generation, and face reenactment. Specifically, training with EFHQ helps models generalize well across diverse poses, significantly improving performance in scenarios involving extreme views, confirmed by extensive experiments. Additionally, we utilize EFHQ to define a challenging cross-view face verification benchmark, in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios, aiming to stimulate studies on face recognition under severe pose conditions in the wild.
Paper Structure (35 sections, 30 figures, 8 tables)

This paper contains 35 sections, 30 figures, 8 tables.

Figures (30)

  • Figure 1: Benefits of our proposed dataset (EFHQ). Standard large-scale facial datasets have most images at near frontal views, causing inferior performance of trained models on downstream tasks when dealing with extreme head poses. For instance, the trained 2D image generators and text-to-image ones often produce only near frontal faces, while the 3D face generators and face reenactment methods often show distorted outputs at profile views. The recently proposed dataset LPFF lpff partially handles that issue by providing complementary images at extreme head poses for only 2D and 3D image generation tasks. Our proposed dataset EFHQ provides high-quality extreme-pose images to complement a wide range of face-related tasks. It supports 2D and 3D image generation, with generally better diversity than LPFF. EFHQ also helps correct the outputs of text-to-image generation and face reenactment at extreme views. Finally, EFHQ provides a more challenging pose-based face verification benchmark to better assess the quality of face recognition networks.
  • Figure 2: The pipeline of EFHQ dataset creation. Starting with high-quality videos from the VFHQvfhq and CelebV-HQcelebvhq datasets, single-frame attributes are extracted then manually reviews. Task-specific preprocessing is then applied to generate specialized versions of the dataset for tasks such as face generation, reenactment, and verification.
  • Figure 3: Example cases where a pose estimator fails to categorize the sample to the correct bin.
  • Figure 4: Pose distribution comparison between our sampled dataset and other datasets, including FFHQstylegan, LPFFlpff, CPLFWcplfw and 40K random samples from VoxCeleb1 vox1. Our sampled dataset demonstrates greater pose diversity, with increased sample counts across high angle bins.
  • Figure 5: Qualitative result and comparison of generated samples, with truncation $\psi=0.7$, from StyleGAN2-ADA training with other dataset and ours.
  • ...and 25 more figures