Table of Contents
Fetching ...

TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

Taorong Liu, Liang Liao, Delin Chen, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin'ichi Satoh

TL;DR

This paper tackles reference-guided image inpainting for large, irregular holes by introducing TransRef, a multi-scale transformer framework that progressively embeds reference information through Ref-PA (patch alignment and harmonization) and Ref-PT (reference patch transformer) to cohere reference guidance with corrupted content. It integrates a hierarchical encoder-decoder with a convolution tail, and relies on a joint loss comprising $\mathcal{L}_1$, perceptual, and style terms to ensure pixel accuracy and perceptual quality. To support research in this area, the authors introduce DPED50K, a large open benchmark of 50K input-reference pairs for training and 2K for testing, derived via SIFT matching from real-world scenes. Experiments show that TransRef outperforms state-of-the-art methods across standard metrics, especially for large holes, and demonstrate promising applicability to object removal and cloud removal in remote sensing. The work advances reference-guided restoration by providing a scalable transformer-based solution and a valuable dataset to foster further development.

Abstract

Image inpainting for completing complicated semantic environments and diverse hole patterns of corrupted images is challenging even for state-of-the-art learning-based inpainting methods trained on large-scale data. A reference image capturing the same scene of a corrupted image offers informative guidance for completing the corrupted image as it shares similar texture and structure priors to that of the holes of the corrupted image. In this work, we propose a transformer-based encoder-decoder network, named TransRef, for reference-guided image inpainting. Specifically, the guidance is conducted progressively through a reference embedding procedure, in which the referencing features are subsequently aligned and fused with the features of the corrupted image. For precise utilization of the reference features for guidance, a reference-patch alignment (Ref-PA) module is proposed to align the patch features of the reference and corrupted images and harmonize their style differences, while a reference-patch transformer (Ref-PT) module is proposed to refine the embedded reference feature. Moreover, to facilitate the research of reference-guided image restoration tasks, we construct a publicly accessible benchmark dataset containing 50K pairs of input and reference images. Both quantitative and qualitative evaluations demonstrate the efficacy of the reference information and the proposed method over the state-of-the-art methods in completing complex holes. Code and dataset can be accessed at https://github.com/Cameltr/TransRef.

TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting

TL;DR

This paper tackles reference-guided image inpainting for large, irregular holes by introducing TransRef, a multi-scale transformer framework that progressively embeds reference information through Ref-PA (patch alignment and harmonization) and Ref-PT (reference patch transformer) to cohere reference guidance with corrupted content. It integrates a hierarchical encoder-decoder with a convolution tail, and relies on a joint loss comprising , perceptual, and style terms to ensure pixel accuracy and perceptual quality. To support research in this area, the authors introduce DPED50K, a large open benchmark of 50K input-reference pairs for training and 2K for testing, derived via SIFT matching from real-world scenes. Experiments show that TransRef outperforms state-of-the-art methods across standard metrics, especially for large holes, and demonstrate promising applicability to object removal and cloud removal in remote sensing. The work advances reference-guided restoration by providing a scalable transformer-based solution and a valuable dataset to foster further development.

Abstract

Image inpainting for completing complicated semantic environments and diverse hole patterns of corrupted images is challenging even for state-of-the-art learning-based inpainting methods trained on large-scale data. A reference image capturing the same scene of a corrupted image offers informative guidance for completing the corrupted image as it shares similar texture and structure priors to that of the holes of the corrupted image. In this work, we propose a transformer-based encoder-decoder network, named TransRef, for reference-guided image inpainting. Specifically, the guidance is conducted progressively through a reference embedding procedure, in which the referencing features are subsequently aligned and fused with the features of the corrupted image. For precise utilization of the reference features for guidance, a reference-patch alignment (Ref-PA) module is proposed to align the patch features of the reference and corrupted images and harmonize their style differences, while a reference-patch transformer (Ref-PT) module is proposed to refine the embedded reference feature. Moreover, to facilitate the research of reference-guided image restoration tasks, we construct a publicly accessible benchmark dataset containing 50K pairs of input and reference images. Both quantitative and qualitative evaluations demonstrate the efficacy of the reference information and the proposed method over the state-of-the-art methods in completing complex holes. Code and dataset can be accessed at https://github.com/Cameltr/TransRef.
Paper Structure (41 sections, 14 equations, 12 figures, 3 tables)

This paper contains 41 sections, 14 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Illustration of image inpainting models. (a) Standard context-based inpainting model; (b) Structure-guided inpainting model; (c) The proposed reference-guided inpainting model.
  • Figure 2: Application scenarios of reference-guided image inpainting, such as object removal (the top and middle rows) and image completion (the bottom row) which all require restoring corrupted scenes to their original state, where the reference images serve as an informative guide to faithfully restore the missing contents.
  • Figure 3: Overview of the proposed TransRef. The first row composed of the overlap patch embedding and the Main-PT modules form the basic hierarchical inpainting framework. The reference guidance is conducted through the reference embedding procedure at each scale by the Ref-PA and Ref-PT modules. In the last, the hierarchical features from the Main-PT module and the decoder features from the transformer decoder block are fed into a convolution tail to generate the completed image.
  • Figure 4: Illustration of the Ref-PA module. The Ref-PA consists of the PA block for patch alignment and the PH block for patch harmonization.
  • Figure 5: Illustration of (a) the Main-Patch Transformer (Main-PT) Module and (b) the Reference-Patch Transformer (Ref-PT) Module.
  • ...and 7 more figures