Table of Contents
Fetching ...

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Aditya Chaudhary, Prachet Dev Singh, Ankit Jha

TL;DR

This work tackles 4× single-image super-resolution with a ViT-based architecture trained in two stages: self-supervised colorization to learn rich visual features, then supervised residual SR fine-tuning to recover high-frequency details. By predicting a high-frequency residual on top of a bicubically upsampled input, the model achieves strong structural fidelity, attaining SSIM of 0.712 and PSNR of 22.90 dB on DIV2K. The results show that colorization pretraining provides a robust starting point, outperforming several state-of-the-art methods in perceptual quality while maintaining competitive PSNR. The study highlights the promise of self-supervised pretraining for complex image restoration tasks and suggests directions for scaling in future work.

Abstract

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

TL;DR

This work tackles 4× single-image super-resolution with a ViT-based architecture trained in two stages: self-supervised colorization to learn rich visual features, then supervised residual SR fine-tuning to recover high-frequency details. By predicting a high-frequency residual on top of a bicubically upsampled input, the model achieves strong structural fidelity, attaining SSIM of 0.712 and PSNR of 22.90 dB on DIV2K. The results show that colorization pretraining provides a robust starting point, outperforming several state-of-the-art methods in perceptual quality while maintaining competitive PSNR. The study highlights the promise of self-supervised pretraining for complex image restoration tasks and suggests directions for scaling in future work.

Abstract

In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

Paper Structure

This paper contains 7 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: ViT-SR receives an upsampled, low-resolution image, and passes it through a ViT encoder–decoder architecture to predict a residual which is added to the input to create the high-resolution output. The model was first pre-trained on image colorization before fine-tuning for SR.