Table of Contents
Fetching ...

High-Fidelity Pluralistic Image Completion with Transformers

Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao

TL;DR

The paper tackles pluralistic image completion by decoupling global structure reconstruction from local texture refinement. It introduces a bi-directional transformer to model low-resolution appearance priors from masked inputs using a discretized visual vocabulary and MLM objective, paired with Gibbs sampling to produce diverse priors, followed by a guided CNN upsampling stage that renders high-fidelity textures aligned to the observed image. This two-stage approach achieves superior fidelity, richer diversity, and strong generalization to large masks and large-scale datasets like ImageNet, validated through extensive quantitative metrics, qualitative comparisons, and user studies. The work demonstrates that combining transformer-based global reasoning with CNN-based local detail synthesis can overcome CNN limitations in global coherence while maintaining high-quality textures in pluralistic image completion.

Abstract

Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.

High-Fidelity Pluralistic Image Completion with Transformers

TL;DR

The paper tackles pluralistic image completion by decoupling global structure reconstruction from local texture refinement. It introduces a bi-directional transformer to model low-resolution appearance priors from masked inputs using a discretized visual vocabulary and MLM objective, paired with Gibbs sampling to produce diverse priors, followed by a guided CNN upsampling stage that renders high-fidelity textures aligned to the observed image. This two-stage approach achieves superior fidelity, richer diversity, and strong generalization to large masks and large-scale datasets like ImageNet, validated through extensive quantitative metrics, qualitative comparisons, and user studies. The work demonstrates that combining transformer-based global reasoning with CNN-based local detail synthesis can overcome CNN limitations in global coherence while maintaining high-quality textures in pluralistic image completion.

Abstract

Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.

Paper Structure

This paper contains 10 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Pluralistic free-form image completion results produced by our method.
  • Figure 2: Pipeline Overview. Our method consists of two networks. The above one is bi-directional transformer, which is responsible for producing the probability distribution of missing regions, then the appearance priors could be reconstructed by sampling from this distribution with diversities. Subsequently, we employ another CNN to upsample the appearance prior to original resolution under the guidance of input masked images. Our method combines both advantages of transformer and CNN, leading to high-fidelity pluralistic image completion performance. E: Encoder, D: Decoder, R: Residual block.
  • Figure 3: Differences between single-directional (left) and bi-directional (right) attention.
  • Figure 4: Qualitative comparison with state-of-the-art methods on FFHQ, Places2 dataset. The completion results of our method are with better quality and diversity.
  • Figure 5: Qualitative comparison with state-of-the-art methods on ImageNet dataset. More qualitative examples are shown in supplementary materials.
  • ...and 5 more figures