Table of Contents
Fetching ...

Learning Position-Aware Implicit Neural Network for Real-World Face Inpainting

Bo Zhao, Huan Yang, Jianlong Fu

TL;DR

This work tackles real-world face inpainting where input shapes and resolutions vary widely, a scenario where prior methods falter in preserving position-sensitive facial structures. It introduces IN^2, an implicit neural inpainting network with a Downsample Processing Encoder, Neighbor Hybrid Attention Blocks, and an Implicit Neural Pyramid Decoder, plus an Adaptive Training Strategy to handle irregular shapes. By explicitly modeling position information through a coordinate-aware decoding pipeline, IN^2 achieves state-of-the-art results on CelebA-HQ in both ideal and real-world settings, notably improving eyes and mouth restoration under arbitrary aspect ratios. The approach demonstrates the practicality of integrating implicit neural representations into face inpainting, enabling robust high-resolution performance without restricting input shape and size.

Abstract

Face inpainting requires the model to have a precise global understanding of the facial position structure. Benefiting from the powerful capabilities of deep learning backbones, recent works in face inpainting have achieved decent performance in ideal setting (square shape with $512px$). However, existing methods often produce a visually unpleasant result, especially in the position-sensitive details (e.g., eyes and nose), when directly applied to arbitrary-shaped images in real-world scenarios. The visually unpleasant position-sensitive details indicate the shortcomings of existing methods in terms of position information processing capability. In this paper, we propose an \textbf{I}mplicit \textbf{N}eural \textbf{I}npainting \textbf{N}etwork (IN$^2$) to handle arbitrary-shape face images in real-world scenarios by explicit modeling for position information. Specifically, a downsample processing encoder is proposed to reduce information loss while obtaining the global semantic feature. A neighbor hybrid attention block is proposed with a hybrid attention mechanism to improve the facial understanding ability of the model without restricting the shape of the input. Finally, an implicit neural pyramid decoder is introduced to explicitly model position information and bridge the gap between low-resolution features and high-resolution output. Extensive experiments demonstrate the superiority of the proposed method in real-world face inpainting task.

Learning Position-Aware Implicit Neural Network for Real-World Face Inpainting

TL;DR

This work tackles real-world face inpainting where input shapes and resolutions vary widely, a scenario where prior methods falter in preserving position-sensitive facial structures. It introduces IN^2, an implicit neural inpainting network with a Downsample Processing Encoder, Neighbor Hybrid Attention Blocks, and an Implicit Neural Pyramid Decoder, plus an Adaptive Training Strategy to handle irregular shapes. By explicitly modeling position information through a coordinate-aware decoding pipeline, IN^2 achieves state-of-the-art results on CelebA-HQ in both ideal and real-world settings, notably improving eyes and mouth restoration under arbitrary aspect ratios. The approach demonstrates the practicality of integrating implicit neural representations into face inpainting, enabling robust high-resolution performance without restricting input shape and size.

Abstract

Face inpainting requires the model to have a precise global understanding of the facial position structure. Benefiting from the powerful capabilities of deep learning backbones, recent works in face inpainting have achieved decent performance in ideal setting (square shape with ). However, existing methods often produce a visually unpleasant result, especially in the position-sensitive details (e.g., eyes and nose), when directly applied to arbitrary-shaped images in real-world scenarios. The visually unpleasant position-sensitive details indicate the shortcomings of existing methods in terms of position information processing capability. In this paper, we propose an \textbf{I}mplicit \textbf{N}eural \textbf{I}npainting \textbf{N}etwork (IN) to handle arbitrary-shape face images in real-world scenarios by explicit modeling for position information. Specifically, a downsample processing encoder is proposed to reduce information loss while obtaining the global semantic feature. A neighbor hybrid attention block is proposed with a hybrid attention mechanism to improve the facial understanding ability of the model without restricting the shape of the input. Finally, an implicit neural pyramid decoder is introduced to explicitly model position information and bridge the gap between low-resolution features and high-resolution output. Extensive experiments demonstrate the superiority of the proposed method in real-world face inpainting task.
Paper Structure (31 sections, 8 equations, 5 figures, 10 tables)

This paper contains 31 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: An Illustration of real-world face inpainting scenarios and comparison between our proposed methods and SOTA approaches. (a) Examples of real-world inpainting images of different aspect ratios and resolutions. (b) Our proposed method can process the input of arbitrary resolution robustly (e.g., $1024px \times 768px$) at the fine structure, even trained around $512px \times 512px$. (c) Existing SOTA methods could only work well on images in $1:1$ aspect ratio with the low-resolution format.
  • Figure 2: Qualitative results of LaMa with various settings in a real-world setting ($1024px\times768px$). Original LaMa suffers a performance drop in a real-world setting, which can be improved by extra resize processing. However, it is still unsatisfactory in position-sensitive details, such as eyes and mouth.
  • Figure 3: The overview of the proposed implicit neural inpainting network for real-world face inpainting task. (a) Implicit neural inpainting network consists of a downsample processing encoder, an attention body, and an implicit neural pyramid decoder. (b) We propose a downsample processing block to replace simple downsampling operation in other methods for more efficient encoding. (c) NHAB can overcome the limitation of window-based attention on image shape and enable superior facial structure learning. (d) Our implicit neural representation block can model position information explicitly for real-world task.
  • Figure 4: Qualitative comparison with different aspect ratios. We display zoom-in results for easier comparison of position-sensitive details and whole results can be found in the appendix.
  • Figure 5: Qualitative comparison with different aspect ratios.