Table of Contents
Fetching ...

RetCompletion:High-Speed Inference Image Completion with Retentive Network

Yueyang Cang, Pingge Hu, Xiaoteng Zhang, Xingtong Wang, Yuhang Liu, Li Shi

TL;DR

RetCompletion tackles the bottleneck of slow inference in pluralistic image completion by adapting RetNet to vision with a two-stage pipeline. It introduces Bi-RetNet to fuse bidirectional context for coherent low-resolution priors and employs pixel-wise inference to enable rapid updates, followed by CNN-based guided upsampling for texture. On ImageNet and CelebA-HQ, it delivers substantial speedups over ICT and RePaint while maintaining competitive quality, validated by quantitative metrics and a user study. This work broadens the applicability of RetNet in computer vision and offers a practical solution for real-time pluralistic inpainting.

Abstract

Time cost is a major challenge in achieving high-quality pluralistic image completion. Recently, the Retentive Network (RetNet) in natural language processing offers a novel approach to this problem with its low-cost inference capabilities. Inspired by this, we apply RetNet to the pluralistic image completion task in computer vision. We present RetCompletion, a two-stage framework. In the first stage, we introduce Bi-RetNet, a bidirectional sequence information fusion model that integrates contextual information from images. During inference, we employ a unidirectional pixel-wise update strategy to restore consistent image structures, achieving both high reconstruction quality and fast inference speed. In the second stage, we use a CNN for low-resolution upsampling to enhance texture details. Experiments on ImageNet and CelebA-HQ demonstrate that our inference speed is 10$\times$ faster than ICT and 15$\times$ faster than RePaint. The proposed RetCompletion significantly improves inference speed and delivers strong performance.

RetCompletion:High-Speed Inference Image Completion with Retentive Network

TL;DR

RetCompletion tackles the bottleneck of slow inference in pluralistic image completion by adapting RetNet to vision with a two-stage pipeline. It introduces Bi-RetNet to fuse bidirectional context for coherent low-resolution priors and employs pixel-wise inference to enable rapid updates, followed by CNN-based guided upsampling for texture. On ImageNet and CelebA-HQ, it delivers substantial speedups over ICT and RePaint while maintaining competitive quality, validated by quantitative metrics and a user study. This work broadens the applicability of RetNet in computer vision and offers a practical solution for real-time pluralistic inpainting.

Abstract

Time cost is a major challenge in achieving high-quality pluralistic image completion. Recently, the Retentive Network (RetNet) in natural language processing offers a novel approach to this problem with its low-cost inference capabilities. Inspired by this, we apply RetNet to the pluralistic image completion task in computer vision. We present RetCompletion, a two-stage framework. In the first stage, we introduce Bi-RetNet, a bidirectional sequence information fusion model that integrates contextual information from images. During inference, we employ a unidirectional pixel-wise update strategy to restore consistent image structures, achieving both high reconstruction quality and fast inference speed. In the second stage, we use a CNN for low-resolution upsampling to enhance texture details. Experiments on ImageNet and CelebA-HQ demonstrate that our inference speed is 10 faster than ICT and 15 faster than RePaint. The proposed RetCompletion significantly improves inference speed and delivers strong performance.
Paper Structure (22 sections, 10 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 10 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Pipeline Overview. Our method consists of two networks, which are trained separately. Based on the Bi-RetNet, the first network is employed for completing low-dimensional images. A parallel representation is utilized during training, predicting all pixels simultaneously to expedite the training process. In contrast, during inference, a recurrent representation is employed, predicting one pixel at a time to enhance the quality of the generated image. The second network, built on a CNN architecture, comprises an encoder, a decoder, and multiple residual blocks. Its primary function is to restore high-dimensional images from their low-dimensional counterparts.
  • Figure 2: Sample images for user study.
  • Figure 3: Comparison of user study results and inference time.