Table of Contents
Fetching ...

HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention

Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum

TL;DR

HINT tackles the challenge of high-quality image inpainting under irregular masks by marrying a mask-aware downsampling scheme with an efficient spatially aware transformer. The MPD module preserves visible information during downsampling, while SCAL provides multiscale channel-spatial attention within a sandwich FFN–Attention–FFN block to capture long-range dependencies without prohibitive cost. The approach achieves state-of-the-art results on CelebA, CelebA-HQ, Places2, and Dunhuang, outperforming CNN-based and diffusion-model baselines in both fidelity and perceptual quality, and does so with competitive parameter counts and runtime. These components collectively enable robust, texture-rich reconstructions suitable for practical inpainting tasks and potentially other masking scenarios.

Abstract

Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions. This may result in information loss from corrupted images where the available information is inherently sparse, especially for the scenario of large missing regions. Recent advances in self-attention mechanisms within transformers have led to significant improvements in many computer vision tasks including inpainting. However, limited by the computational costs, existing methods cannot fully exploit the efficacy of long-range modelling capabilities of such models. In this paper, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for high-level inferences made within the model. Moreover, we propose a Spatially-activated Channel Attention Layer (SCAL), an efficient self-attention mechanism interpreting spatial awareness to model the corrupted image at multiple scales. To further enhance the effectiveness of SCAL, motivated by recent advanced in speech recognition, we introduce a sandwich structure that places feed-forward networks before and after the SCAL module. We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets, CelebA, CelebA-HQ, Places2, and Dunhuang.

HINT: High-quality INPainting Transformer with Mask-Aware Encoding and Enhanced Attention

TL;DR

HINT tackles the challenge of high-quality image inpainting under irregular masks by marrying a mask-aware downsampling scheme with an efficient spatially aware transformer. The MPD module preserves visible information during downsampling, while SCAL provides multiscale channel-spatial attention within a sandwich FFN–Attention–FFN block to capture long-range dependencies without prohibitive cost. The approach achieves state-of-the-art results on CelebA, CelebA-HQ, Places2, and Dunhuang, outperforming CNN-based and diffusion-model baselines in both fidelity and perceptual quality, and does so with competitive parameter counts and runtime. These components collectively enable robust, texture-rich reconstructions suitable for practical inpainting tasks and potentially other masking scenarios.

Abstract

Existing image inpainting methods leverage convolution-based downsampling approaches to reduce spatial dimensions. This may result in information loss from corrupted images where the available information is inherently sparse, especially for the scenario of large missing regions. Recent advances in self-attention mechanisms within transformers have led to significant improvements in many computer vision tasks including inpainting. However, limited by the computational costs, existing methods cannot fully exploit the efficacy of long-range modelling capabilities of such models. In this paper, we propose an end-to-end High-quality INpainting Transformer, abbreviated as HINT, which consists of a novel mask-aware pixel-shuffle downsampling module (MPD) to preserve the visible information extracted from the corrupted image while maintaining the integrity of the information available for high-level inferences made within the model. Moreover, we propose a Spatially-activated Channel Attention Layer (SCAL), an efficient self-attention mechanism interpreting spatial awareness to model the corrupted image at multiple scales. To further enhance the effectiveness of SCAL, motivated by recent advanced in speech recognition, we introduce a sandwich structure that places feed-forward networks before and after the SCAL module. We demonstrate the superior performance of HINT compared to contemporary state-of-the-art models on four datasets, CelebA, CelebA-HQ, Places2, and Dunhuang.
Paper Structure (17 sections, 9 equations, 9 figures, 11 tables)

This paper contains 17 sections, 9 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Comparisons with the state of the art suvorov2022resolutionli2022matli2022misf on different datasets karras2017progressivezhou2017placesyu2019dunhuang with large masks (shown in white areas). Red boxes highlight major differences. The bottom two examples are from unseen real-world high-resolution images.
  • Figure 2: The overview of the proposed framework, which is built with a gated embedding block, with multiple stacked "sandwiches" in different levels. The "sandwich" is described in Sec. \ref{['sec:methodology:transformer_body:sandwich']}, the MPD is described in Sec. \ref{['sec:methodology:mpd']}
  • Figure 3: The comparison of Pixel-shuffle Down-sampling (PD, upper) and the proposed Mask-aware Pixel-shuffle Down-sampling (MPD, lower). Ours proposed MPD, with one $3 \times 3$ convolution, a conventional PD, interlacing (concatenation of feature and mask slices), and a masked-separable convolution. Invalid pixel drifting happens in $\hat{X}$. After the feature ${X}'$ is downsampled, the masked position becomes inconsistent across channels.
  • Figure 4: "Sandwich" (right) and "Spatially-activated Channel Attention Layer" (left). "$\bigoplus$","$\bigotimes$", and "$\bigodot$" denote the element-wise sum, matrix multiplication, and element-wise multiplication, respectively.
  • Figure 5: Comparisons with visualisations $(256 \times 256)$ showing that our results are more coherent in structure and sharper in texture and semantic details. The top two rows are from CelebA-HQ karras2017progressive and the bottom two rows are from Places2 zhou2017places.
  • ...and 4 more figures