Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling
Aditya Chaudhary, Prachet Dev Singh, Ankit Jha
TL;DR
This work tackles 4× single-image super-resolution with a ViT-based architecture trained in two stages: self-supervised colorization to learn rich visual features, then supervised residual SR fine-tuning to recover high-frequency details. By predicting a high-frequency residual on top of a bicubically upsampled input, the model achieves strong structural fidelity, attaining SSIM of 0.712 and PSNR of 22.90 dB on DIV2K. The results show that colorization pretraining provides a robust starting point, outperforming several state-of-the-art methods in perceptual quality while maintaining competitive PSNR. The study highlights the promise of self-supervised pretraining for complex image restoration tasks and suggests directions for scaling in future work.
Abstract
In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.
