Table of Contents
Fetching ...

Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

Arbish Akram, Nazar Khan, Arif Mahmood

Abstract

Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.

Improving Generative Adversarial Network Generalization for Facial Expression Synthesis

Abstract

Facial expression synthesis aims to generate realistic facial expressions while preserving identity. Existing conditional generative adversarial networks (GANs) achieve excellent image-to-image translation results, but their performance often degrades when test images differ from the training dataset. We present Regression GAN (RegGAN), a model that learns an intermediate representation to improve generalization beyond the training distribution. RegGAN consists of two components: a regression layer with local receptive fields that learns expression details by minimizing the reconstruction error through a ridge regression loss, and a refinement network trained adversarially to enhance the realism of generated images. We train RegGAN on the CFEE dataset and evaluate its generalization performance both on CFEE and challenging out-of-distribution images, including celebrity photos, portraits, statues, and avatar renderings. For evaluation, we employ four widely used metrics: Expression Classification Score (ECS) for expression quality, Face Similarity Score (FSS) for identity preservation, QualiCLIP for perceptual realism, and Fréchet Inception Distance (FID) for assessing both expression quality and realism. RegGAN outperforms six state-of-the-art models in ECS, FID, and QualiCLIP, while ranking second in FSS. Human evaluations indicate that RegGAN surpasses the best competing model by 25% in expression quality, 26% in identity preservation, and 30% in realism.
Paper Structure (25 sections, 14 equations, 11 figures, 3 tables)

This paper contains 25 sections, 14 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Given a neutral input image, the proposed RegGAN synthesizes photorealistic facial expressions on out-of-distribution testing images. Despite being trained only on real human faces from the CFEE dataset du2014compound, the proposed method introduces realistic expressions, preserves identity and retains facial details of input images.
  • Figure 2: Left: Our method consists of two components: an expression layer and a refinement network. The expression layer takes an input image and generates a new image of the same person with a different facial expression. The refinement network then enhances the quality of the synthesized image, making it more realistic and sharper. Right: Architecture of EAB.
  • Figure 3: Illustration of the input, intermediate, and final outputs synthesized by our proposed RegGAN. The expression layer $G_E$ generates an intermediate image that captures the target expression, while the refinement network $G_R$ transforms this intermediate result into a convincing photorealistic output.
  • Figure 4: Results of RegGAN for facial expression synthesis on out-of-distribution images, including an impasto face (row 1), a celebrity face (row 2), a fantasy image (row 3), a portrait (row 4), as well as an avatar (row 5). Impasto and fantasy images were generated using stable diffusion stable. The proposed method introduces convincing expressions while preserving the identity and facial details of the input image.
  • Figure 5: Comparison of the proposed method, RegGAN, with six state-of-the-art facial expression synthesis models - StarGAN, GANimation, MR, DAI2I, SARGAN as well as US-GAN on out-of-distribution facial images. GANimation synthesizes realistic expression but introduces noticeable artifacts. StarGAN and DAI2I fail to preserve the input image's color distribution in their output. SARGAN and US-GAN preserve facial details but are unable to induce happy expressions. MR generates realistic happy expressions, but the results are often blurry. In contrast, RegGAN produces sharper and more realistic expressions on out-of-distribution images. Moreover, RegGAN demonstrates consistent performance across diverse image types, including human portraits, status, and avatars collected from the Internet, despite being trained exclusively on CFEE dataset captured in a controlled environment with consistent lighting and background.
  • ...and 6 more figures