Analyzing the Feature Extractor Networks for Face Image Synthesis

Erdi Sarıtaş; Hazım Kemal Ekenel

Analyzing the Feature Extractor Networks for Face Image Synthesis

Erdi Sarıtaş, Hazım Kemal Ekenel

TL;DR

This work tackles the open problem of how to reliably evaluate realism in face image synthesis by comparing four diverse feature extractors—InceptionV3, CLIP, DINOv2, and ArcFace—across three metrics: FID, KID, and Precision-Recall. It uses FFHQ as the target domain and CelebA-HQ plus synthetic data from StyleGAN2 and Projected FastGAN as sources, additionally exploring $L_2$ feature normalization and attention patterns via Grad-CAM and PaCMAP embeddings. Findings indicate that InceptionV3 and DINOv2 can misalign with perceptual realism, CLIP remains relatively stable, and ArcFace shows strong numeric and embedding-space performance albeit with perplexing attention maps. The results provide practical guidance for selecting feature extractors and normalization strategies in face-synthesis evaluation and point toward diffusion-based methods as a future direction to improve realism assessment.

Abstract

Advancements like Generative Adversarial Networks have attracted the attention of researchers toward face image synthesis to generate ever more realistic images. Thereby, the need for the evaluation criteria to assess the realism of the generated images has become apparent. While FID utilized with InceptionV3 is one of the primary choices for benchmarking, concerns about InceptionV3's limitations for face images have emerged. This study investigates the behavior of diverse feature extractors -- InceptionV3, CLIP, DINOv2, and ArcFace -- considering a variety of metrics -- FID, KID, Precision\&Recall. While the FFHQ dataset is used as the target domain, as the source domains, the CelebA-HQ dataset and the synthetic datasets generated using StyleGAN2 and Projected FastGAN are used. Experiments include deep-down analysis of the features: $L_2$ normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.

Analyzing the Feature Extractor Networks for Face Image Synthesis

TL;DR

feature normalization and attention patterns via Grad-CAM and PaCMAP embeddings. Findings indicate that InceptionV3 and DINOv2 can misalign with perceptual realism, CLIP remains relatively stable, and ArcFace shows strong numeric and embedding-space performance albeit with perplexing attention maps. The results provide practical guidance for selecting feature extractors and normalization strategies in face-synthesis evaluation and point toward diffusion-based methods as a future direction to improve realism assessment.

Abstract

normalization, model attention during extraction, and domain distributions in the feature space. We aim to give valuable insights into the behavior of feature extractors for evaluating face image synthesis methodologies. The code is publicly available at https://github.com/ThEnded32/AnalyzingFeatureExtractors.

Paper Structure (12 sections, 5 equations, 3 figures, 4 tables)

This paper contains 12 sections, 5 equations, 3 figures, 4 tables.

INTRODUCTION
RELATED WORKS
Evaluation Metrics
Feature Extractors
EXPERIMENTAL SETUP
RESULTS
Comparison of StyleGAN2 & ProjectedGAN
Heat Map Analysis
Analysis with CelebA-HQ
PaCMAP Analysis
CONCLUSIONS
ACKNOWLEDGMENTS

Figures (3)

Figure 1: Sample images from FFHQffhq, CelebA-HQ celeba, and synthetic StyleGAN2 stylegan2, ProjectedGAN projectedgan generated datasets.
Figure 2: Averaged heat maps using the generated images from StyleGAN2. The resolution of heat maps are 8x8, 7x7, 16x16, and 7x7 with the order of InceptionV3, CLIP, DINOv2, and ArcFace. In the heat maps, the left shows the original, and the right shows an overlay on a sample image. The used color map is given below, and attention increases from blue to yellow.
Figure 3: PaCMAP visualization results. In each sub-figure, blue dots represent the target, and red dots represent the source data points.

Analyzing the Feature Extractor Networks for Face Image Synthesis

TL;DR

Abstract

Analyzing the Feature Extractor Networks for Face Image Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)