RetCompletion:High-Speed Inference Image Completion with Retentive Network
Yueyang Cang, Pingge Hu, Xiaoteng Zhang, Xingtong Wang, Yuhang Liu, Li Shi
TL;DR
RetCompletion tackles the bottleneck of slow inference in pluralistic image completion by adapting RetNet to vision with a two-stage pipeline. It introduces Bi-RetNet to fuse bidirectional context for coherent low-resolution priors and employs pixel-wise inference to enable rapid updates, followed by CNN-based guided upsampling for texture. On ImageNet and CelebA-HQ, it delivers substantial speedups over ICT and RePaint while maintaining competitive quality, validated by quantitative metrics and a user study. This work broadens the applicability of RetNet in computer vision and offers a practical solution for real-time pluralistic inpainting.
Abstract
Time cost is a major challenge in achieving high-quality pluralistic image completion. Recently, the Retentive Network (RetNet) in natural language processing offers a novel approach to this problem with its low-cost inference capabilities. Inspired by this, we apply RetNet to the pluralistic image completion task in computer vision. We present RetCompletion, a two-stage framework. In the first stage, we introduce Bi-RetNet, a bidirectional sequence information fusion model that integrates contextual information from images. During inference, we employ a unidirectional pixel-wise update strategy to restore consistent image structures, achieving both high reconstruction quality and fast inference speed. In the second stage, we use a CNN for low-resolution upsampling to enhance texture details. Experiments on ImageNet and CelebA-HQ demonstrate that our inference speed is 10$\times$ faster than ICT and 15$\times$ faster than RePaint. The proposed RetCompletion significantly improves inference speed and delivers strong performance.
