High-Fidelity Pluralistic Image Completion with Transformers
Ziyu Wan, Jingbo Zhang, Dongdong Chen, Jing Liao
TL;DR
The paper tackles pluralistic image completion by decoupling global structure reconstruction from local texture refinement. It introduces a bi-directional transformer to model low-resolution appearance priors from masked inputs using a discretized visual vocabulary and MLM objective, paired with Gibbs sampling to produce diverse priors, followed by a guided CNN upsampling stage that renders high-fidelity textures aligned to the observed image. This two-stage approach achieves superior fidelity, richer diversity, and strong generalization to large masks and large-scale datasets like ImageNet, validated through extensive quantitative metrics, qualitative comparisons, and user studies. The work demonstrates that combining transformer-based global reasoning with CNN-based local detail synthesis can overcome CNN limitations in global coherence while maintaining high-quality textures in pluralistic image completion.
Abstract
Image completion has made tremendous progress with convolutional neural networks (CNNs), because of their powerful texture modeling capacity. However, due to some inherent properties (e.g., local inductive prior, spatial-invariant kernels), CNNs do not perform well in understanding global structures or naturally support pluralistic completion. Recently, transformers demonstrate their power in modeling the long-term relationship and generating diverse results, but their computation complexity is quadratic to input length, thus hampering the application in processing high-resolution images. This paper brings the best of both worlds to pluralistic image completion: appearance prior reconstruction with transformer and texture replenishment with CNN. The former transformer recovers pluralistic coherent structures together with some coarse textures, while the latter CNN enhances the local texture details of coarse priors guided by the high-resolution masked images. The proposed method vastly outperforms state-of-the-art methods in terms of three aspects: 1) large performance boost on image fidelity even compared to deterministic completion methods; 2) better diversity and higher fidelity for pluralistic completion; 3) exceptional generalization ability on large masks and generic dataset, like ImageNet.
