Table of Contents
Fetching ...

Face Mask Removal with Region-attentive Face Inpainting

Minmin Yang

TL;DR

This paper tackles the challenge of removing face masks and restoring masked facial regions without sacrificing identity. It introduces a two-stage framework combining a segmentation network for mask localization and an inpainting network that uses Multi-scale Channel-Spatial Attention Modules (M-CSAM) and region-focused supervision to achieve high-fidelity, identity-preserving reconstructions. The authors also synthesize the Masked-Faces dataset from CelebA with five mask types to evaluate their method, showing superior SSIM, PSNR, and $\ell_1$ results and improved face recognition performance over four baselines. The approach demonstrates practical potential for social interaction, video/image editing, and recognition tasks, while acknowledging a larger network size and dataset-driven limitations when applied to more naturalistic imagery.

Abstract

During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.

Face Mask Removal with Region-attentive Face Inpainting

TL;DR

This paper tackles the challenge of removing face masks and restoring masked facial regions without sacrificing identity. It introduces a two-stage framework combining a segmentation network for mask localization and an inpainting network that uses Multi-scale Channel-Spatial Attention Modules (M-CSAM) and region-focused supervision to achieve high-fidelity, identity-preserving reconstructions. The authors also synthesize the Masked-Faces dataset from CelebA with five mask types to evaluate their method, showing superior SSIM, PSNR, and results and improved face recognition performance over four baselines. The approach demonstrates practical potential for social interaction, video/image editing, and recognition tasks, while acknowledging a larger network size and dataset-driven limitations when applied to more naturalistic imagery.

Abstract

During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.
Paper Structure (25 sections, 10 equations, 5 figures, 4 tables)

This paper contains 25 sections, 10 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The main architecture of the proposed approach. The legend shows what each colored box refers to.
  • Figure 2: Illustration of how Masked-Faces dataset is synthesized. Top row shows the five types of masks used. Last row shows the output of face detection with 21 facial landmarks. Middle row shows the masks aligned and placed with respect to the landmarks.
  • Figure 3: Qualitative comparison of models with and without CSAM. First row: input masked images; Second row: ground-truth images; Third row: results from the network with CSAM incorporated; Forth row: results from the network without CSAM (best viewed when zoomed-in).
  • Figure 4: Qualitative comparison. Different approaches are compared on the Masked-Faces dataset.
  • Figure 5: Qualitative comparison of local area supervision and the supervision over the whole image. First and second rows are the input masked image and ground-truth images, respectively. The third and fourth rows show the results from local supervision and from full supervision, respectively.