Table of Contents
Fetching ...

Towards Realistic Landmark-Guided Facial Video Inpainting Based on GANs

Fatemeh Ghorbani Lohesara, Karen Egiazarian, Sebastian Knorr

TL;DR

This paper tackles realistic facial video inpainting under both static and moving occlusions by introducing a GAN-based, expression-aware framework. The method leverages facial landmarks and a single occlusion-free reference frame, combined with a Temporal Shift Module (online/offline) and an FER loss to preserve identity and emotions across frames. It achieves superior quantitative and qualitative results on FaceForensics compared to LGTSM and CombCN, demonstrating robust occlusion removal and temporal coherence. The approach enables practical applications in video conferencing, telemedicine, and privacy-preserving video editing, with potential extensions to higher-resolution 2D and 3D volumetric video inpainting.

Abstract

Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.

Towards Realistic Landmark-Guided Facial Video Inpainting Based on GANs

TL;DR

This paper tackles realistic facial video inpainting under both static and moving occlusions by introducing a GAN-based, expression-aware framework. The method leverages facial landmarks and a single occlusion-free reference frame, combined with a Temporal Shift Module (online/offline) and an FER loss to preserve identity and emotions across frames. It achieves superior quantitative and qualitative results on FaceForensics compared to LGTSM and CombCN, demonstrating robust occlusion removal and temporal coherence. The approach enables practical applications in video conferencing, telemedicine, and privacy-preserving video editing, with potential extensions to higher-resolution 2D and 3D volumetric video inpainting.

Abstract

Facial video inpainting plays a crucial role in a wide range of applications, including but not limited to the removal of obstructions in video conferencing and telemedicine, enhancement of facial expression analysis, privacy protection, integration of graphical overlays, and virtual makeup. This domain presents serious challenges due to the intricate nature of facial features and the inherent human familiarity with faces, heightening the need for accurate and persuasive completions. In addressing challenges specifically related to occlusion removal in this context, our focus is on the progressive task of generating complete images from facial data covered by masks, ensuring both spatial and temporal coherence. Our study introduces a network designed for expression-based video inpainting, employing generative adversarial networks (GANs) to handle static and moving occlusions across all frames. By utilizing facial landmarks and an occlusion-free reference image, our model maintains the user's identity consistently across frames. We further enhance emotional preservation through a customized facial expression recognition (FER) loss function, ensuring detailed inpainted outputs. Our proposed framework exhibits proficiency in eliminating occlusions from facial videos in an adaptive form, whether appearing static or dynamic on the frames, while providing realistic and coherent results.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the pipeline of the proposed GAN-based expression-aware inpainting with the support of facial landmarks and a single occlusion-free reference frame. The masked images and facial landmarks are provided as input to the generator (G) to synthesize the complete face images. The discriminator (D) then classifies generated faces as real or fake.
  • Figure 2: Sample of inpainted frames in FaceForensics validation set (ID 18) resulted from our model, LGTSM, and CombCN, along with the corresponding input and GT frames. The applied masks are static on the frames. Images: RCN TV (https://www.youtube.com/watch?v=8ILvKPA3TI0)
  • Figure 3: Sample of inpainted frames in FaceForensics validation set (ID 73) resulted from our model, LGTSM, and CombCN, along with the corresponding input and GT frames. The masks vary along the frames. Images: MTV Lebanon News (https://www.youtube.com/watch?v=irbGBNQaZ1E)