Table of Contents
Fetching ...

Expression-aware video inpainting for HMD removal in XR applications

Fatemeh Ghorbani Lohesara, Karen Egiazarian, Sebastian Knorr

TL;DR

This work tackles the persistent challenge of HMD occlusion in social XR by introducing EVI-HRnet, a GAN-based, expression-aware video inpainting framework that uses facial landmarks and a single unoccluded reference frame to restore occluded upper-face content while preserving identity. A novel Facial Expression Recognition (FER) loss guides the generator to maintain authentic emotional cues across frames, enabling temporally consistent results suitable for teleconferencing and collaborative XR. Quantitative and qualitative evaluations on the FaceForensics dataset with synthetic HMD masks show that EVI-HRnet outperforms state-of-the-art baselines, with robust eye detail recovery and reduced temporal artifacts, particularly when FER loss and landmarks are employed. The approach offers a lightweight, hardware-free solution with practical impact for enhancing social XR experiences and could be extended with internal HMD cameras and 3D facial modeling to further improve robustness and realism.

Abstract

Head-mounted displays (HMDs) serve as indispensable devices for observing extended reality (XR) environments and virtual content. However, HMDs present an obstacle to external recording techniques as they block the upper face of the user. This limitation significantly affects social XR applications, specifically teleconferencing, where facial features and eye gaze information play a vital role in creating an immersive user experience. In this study, we propose a new network for expression-aware video inpainting for HMD removal (EVI-HRnet) based on generative adversarial networks (GANs). Our model effectively fills in missing information with regard to facial landmarks and a single occlusion-free reference image of the user. The framework and its components ensure the preservation of the user's identity across frames using the reference frame. To further improve the level of realism of the inpainted output, we introduce a novel facial expression recognition (FER) loss function for emotion preservation. Our results demonstrate the remarkable capability of the proposed framework to remove HMDs from facial videos while maintaining the subject's facial expression and identity. Moreover, the outputs exhibit temporal consistency along the inpainted frames. This lightweight framework presents a practical approach for HMD occlusion removal, with the potential to enhance various collaborative XR applications without the need for additional hardware.

Expression-aware video inpainting for HMD removal in XR applications

TL;DR

This work tackles the persistent challenge of HMD occlusion in social XR by introducing EVI-HRnet, a GAN-based, expression-aware video inpainting framework that uses facial landmarks and a single unoccluded reference frame to restore occluded upper-face content while preserving identity. A novel Facial Expression Recognition (FER) loss guides the generator to maintain authentic emotional cues across frames, enabling temporally consistent results suitable for teleconferencing and collaborative XR. Quantitative and qualitative evaluations on the FaceForensics dataset with synthetic HMD masks show that EVI-HRnet outperforms state-of-the-art baselines, with robust eye detail recovery and reduced temporal artifacts, particularly when FER loss and landmarks are employed. The approach offers a lightweight, hardware-free solution with practical impact for enhancing social XR experiences and could be extended with internal HMD cameras and 3D facial modeling to further improve robustness and realism.

Abstract

Head-mounted displays (HMDs) serve as indispensable devices for observing extended reality (XR) environments and virtual content. However, HMDs present an obstacle to external recording techniques as they block the upper face of the user. This limitation significantly affects social XR applications, specifically teleconferencing, where facial features and eye gaze information play a vital role in creating an immersive user experience. In this study, we propose a new network for expression-aware video inpainting for HMD removal (EVI-HRnet) based on generative adversarial networks (GANs). Our model effectively fills in missing information with regard to facial landmarks and a single occlusion-free reference image of the user. The framework and its components ensure the preservation of the user's identity across frames using the reference frame. To further improve the level of realism of the inpainted output, we introduce a novel facial expression recognition (FER) loss function for emotion preservation. Our results demonstrate the remarkable capability of the proposed framework to remove HMDs from facial videos while maintaining the subject's facial expression and identity. Moreover, the outputs exhibit temporal consistency along the inpainted frames. This lightweight framework presents a practical approach for HMD occlusion removal, with the potential to enhance various collaborative XR applications without the need for additional hardware.
Paper Structure (23 sections, 8 equations, 5 figures, 1 table)

This paper contains 23 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The proposed framework of EVI-HRnet: (A) The overall inpainting network including a generator and a discriminator, (B) details of the attention-based LGTSM generator architecture, and (C) the components of the attention module (images: TSN (https://www.youtube.com/watch?v=7sSPBQvxImQ)).
  • Figure 2: Reference image (first frame) used for inpainting for a sequence (ID 78) in the FaceForensics testing set (image: MTV Lebanon News (https://www.youtube.com/watch?v=nZJJVq_Mfvg)).
  • Figure 3: Sample of inpainted frames in FaceForensics validation set (ID 78) resulted from EVI-HRnet, EVI-HRnet without landmarks, and EVI-HRnet without the FER loss (images: MTV Lebanon News (https://www.youtube.com/watch?v=nZJJVq_Mfvg)).
  • Figure 4: Sample of the qualitative results of FaceForensics validation set with HMD masks, and their GT and inputs (ID 78). The inpainted frames are selected from the results of EVI-HRnet, LGTSM, and CombCN (image: MTV Lebanon News (https://www.youtube.com/watch?v=nZJJVq_Mfvg)).
  • Figure 5: Examples of inpainted results of reference images with closed eyes for a sequence (ID 10) in the FaceForensics testing set. From left to right: reference frame, input, GT, and result from EVI-HRnet (image: Al Aan TV (https://www.youtube.com/watch?v=2iBnQpy8OMw)).