FacEnhance: Facial Expression Enhancing with Recurrent DDPMs
Hamza Bouzid, Lahoucine Ballihi
TL;DR
FacEnhance tackles the challenge of generating high-fidelity facial expression videos from low-resolution inputs by coupling lightweight 64×64 expression generation with a diffusion-based enhancer that outputs 192×192 frames. The method uses conditional denoising within a DDPM, guided by a low-resolution expression frame $f_{low}^n$, a neutral high-resolution identity image $I_{Id}$, and the previously generated high-resolution frame $ar{f}_{high}^{n-1}$, with an expression encoder to inject expression cues. Extensive experiments on the MUG dataset show that FacEnhance improves FVD, PSNR, and SSIM while preserving identity through ablations and comparisons against state-of-the-art baselines, validating the approach as a resource-efficient path to high-fidelity facial expression video generation. The work indicates practical impact for applications requiring high-resolution, temporally coherent facial videos with realistic backgrounds, while acknowledging computational demands and occasional failures that motivate future efficiency improvements and higher-resolution extensions.
Abstract
Facial expressions, vital in non-verbal human communication, have found applications in various computer vision fields like virtual reality, gaming, and emotional AI assistants. Despite advancements, many facial expression generation models encounter challenges such as low resolution (e.g., 32x32 or 64x64 pixels), poor quality, and the absence of background details. In this paper, we introduce FacEnhance, a novel diffusion-based approach addressing constraints in existing low-resolution facial expression generation models. FacEnhance enhances low-resolution facial expression videos (64x64 pixels) to higher resolutions (192x192 pixels), incorporating background details and improving overall quality. Leveraging conditional denoising within a diffusion framework, guided by a background-free low-resolution video and a single neutral expression high-resolution image, FacEnhance generates a video incorporating the facial expression from the low-resolution video performed by the individual with background from the neutral image. By complementing lightweight low-resolution models, FacEnhance strikes a balance between computational efficiency and desirable image resolution and quality. Extensive experiments on the MUG facial expression database demonstrate the efficacy of FacEnhance in enhancing low-resolution model outputs to state-of-the-art quality while preserving content and identity consistency. FacEnhance represents significant progress towards resource-efficient, high-fidelity facial expression generation, Renewing outdated low-resolution methods to up-to-date standards.
