FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

Hamza Bouzid; Lahoucine Ballihi

FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

Hamza Bouzid, Lahoucine Ballihi

TL;DR

FacEnhance tackles the challenge of generating high-fidelity facial expression videos from low-resolution inputs by coupling lightweight 64×64 expression generation with a diffusion-based enhancer that outputs 192×192 frames. The method uses conditional denoising within a DDPM, guided by a low-resolution expression frame $f_{low}^n$, a neutral high-resolution identity image $I_{Id}$, and the previously generated high-resolution frame $ar{f}_{high}^{n-1}$, with an expression encoder to inject expression cues. Extensive experiments on the MUG dataset show that FacEnhance improves FVD, PSNR, and SSIM while preserving identity through ablations and comparisons against state-of-the-art baselines, validating the approach as a resource-efficient path to high-fidelity facial expression video generation. The work indicates practical impact for applications requiring high-resolution, temporally coherent facial videos with realistic backgrounds, while acknowledging computational demands and occasional failures that motivate future efficiency improvements and higher-resolution extensions.

Abstract

Facial expressions, vital in non-verbal human communication, have found applications in various computer vision fields like virtual reality, gaming, and emotional AI assistants. Despite advancements, many facial expression generation models encounter challenges such as low resolution (e.g., 32x32 or 64x64 pixels), poor quality, and the absence of background details. In this paper, we introduce FacEnhance, a novel diffusion-based approach addressing constraints in existing low-resolution facial expression generation models. FacEnhance enhances low-resolution facial expression videos (64x64 pixels) to higher resolutions (192x192 pixels), incorporating background details and improving overall quality. Leveraging conditional denoising within a diffusion framework, guided by a background-free low-resolution video and a single neutral expression high-resolution image, FacEnhance generates a video incorporating the facial expression from the low-resolution video performed by the individual with background from the neutral image. By complementing lightweight low-resolution models, FacEnhance strikes a balance between computational efficiency and desirable image resolution and quality. Extensive experiments on the MUG facial expression database demonstrate the efficacy of FacEnhance in enhancing low-resolution model outputs to state-of-the-art quality while preserving content and identity consistency. FacEnhance represents significant progress towards resource-efficient, high-fidelity facial expression generation, Renewing outdated low-resolution methods to up-to-date standards.

FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

TL;DR

, a neutral high-resolution identity image

, and the previously generated high-resolution frame

, with an expression encoder to inject expression cues. Extensive experiments on the MUG dataset show that FacEnhance improves FVD, PSNR, and SSIM while preserving identity through ablations and comparisons against state-of-the-art baselines, validating the approach as a resource-efficient path to high-fidelity facial expression video generation. The work indicates practical impact for applications requiring high-resolution, temporally coherent facial videos with realistic backgrounds, while acknowledging computational demands and occasional failures that motivate future efficiency improvements and higher-resolution extensions.

Abstract

Paper Structure (21 sections, 8 equations, 8 figures, 3 tables, 3 algorithms)

This paper contains 21 sections, 8 equations, 8 figures, 3 tables, 3 algorithms.

Introduction
Related Work
PRELIMINARIES: DIFFUSION MODELS
Proposed Approach
Our proposed Model: FacEnahance
Inputs
The used architecture
Training
Inference
video Inference
EXPERIMENT
DATASET
IMPLEMENTATION DETAILS
EVALUATION METRICS
EXPERIMENTAL RESULTS
...and 6 more sections

Figures (8)

Figure 1: A graphical representation of diffusion models, highlighting the noise diffusion process $q(x_t|x_{t - 1})$ and the denoising process $p_{\theta}(x_{t - 1}|x_t)$
Figure 2: Overview of the proposed facial expression enhancement model. The diffusion model refines Gaussian noise iteratively, guided by a low-resolution frame $f_{l}^{n}$ from input video $v_{l}$, a neutral high-resolution image $I_{Id}$, and the previously generated frame $\Bar{f}_{h}^{n-1}$, resulting in an improved higher-resolution frame $\Bar{f}_{high}^{n}$.
Figure 3: Video examples generated by our model showcasing the six basic facial expressions.The right side features generated videos of the six facial expressions performed by the same individual, while the left side presents six different individuals, with each person expressing one specific emotion.
Figure 4: Qualitative comparison of facial expression sequences before and after enhancement using our model. On the far left of the figure, we display the input images. Adjacent to them are videos generated by VideoGAN (a), ImaGINator (b), MotionGAN (c), and FEV-GAN (d). On the right side, we present the corresponding enhanced videos by FacEnhance.
Figure 5: Qualitative comparison of sequences generated by the FacEnhance model and state-of-the-art models on the MUG database. The sequences of our model (a, c, e, g), ImaGINator (b), and LFDM (f) are randomly selected from the test results, and the sequences of VDM (c) and FDM (e) are taken from ni2023conditional.
...and 3 more figures

FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

TL;DR

Abstract

FacEnhance: Facial Expression Enhancing with Recurrent DDPMs

Authors

TL;DR

Abstract

Table of Contents

Figures (8)