Towards Realistic Landmark-Guided Facial Video Inpainting Based on GANs
Fatemeh Ghorbani Lohesara, Karen Egiazarian, Sebastian Knorr
TL;DR
This paper tackles realistic facial video inpainting under both static and moving occlusions by introducing a GAN-based, expression-aware framework. The method leverages facial landmarks and a single occlusion-free reference frame, combined with a Temporal Shift Module (online/offline) and an FER loss to preserve identity and emotions across frames. It achieves superior quantitative and qualitative results on FaceForensics compared to LGTSM and CombCN, demonstrating robust occlusion removal and temporal coherence. The approach enables practical applications in video conferencing, telemedicine, and privacy-preserving video editing, with potential extensions to higher-resolution 2D and 3D volumetric video inpainting.
Abstract
Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.
