Face Mask Removal with Region-attentive Face Inpainting
Minmin Yang
TL;DR
This paper tackles the challenge of removing face masks and restoring masked facial regions without sacrificing identity. It introduces a two-stage framework combining a segmentation network for mask localization and an inpainting network that uses Multi-scale Channel-Spatial Attention Modules (M-CSAM) and region-focused supervision to achieve high-fidelity, identity-preserving reconstructions. The authors also synthesize the Masked-Faces dataset from CelebA with five mask types to evaluate their method, showing superior SSIM, PSNR, and $\ell_1$ results and improved face recognition performance over four baselines. The approach demonstrates practical potential for social interaction, video/image editing, and recognition tasks, while acknowledging a larger network size and dataset-driven limitations when applied to more naturalistic imagery.
Abstract
During the COVID-19 pandemic, face masks have become ubiquitous in our lives. Face masks can cause some face recognition models to fail since they cover significant portion of a face. In addition, removing face masks from captured images or videos can be desirable, e.g., for better social interaction and for image/video editing and enhancement purposes. Hence, we propose a generative face inpainting method to effectively recover/reconstruct the masked part of a face. Face inpainting is more challenging compared to traditional inpainting, since it requires high fidelity while maintaining the identity at the same time. Our proposed method includes a Multi-scale Channel-Spatial Attention Module (M-CSAM) to mitigate the spatial information loss and learn the inter- and intra-channel correlation. In addition, we introduce an approach enforcing the supervised signal to focus on masked regions instead of the whole image. We also synthesize our own Masked-Faces dataset from the CelebA dataset by incorporating five different types of face masks, including surgical mask, regular mask and scarves, which also cover the neck area. The experimental results show that our proposed method outperforms different baselines in terms of structural similarity index measure, peak signal-to-noise ratio and l1 loss, while also providing better outputs qualitatively. The code will be made publicly available. Code is available at GitHub.
