Table of Contents
Fetching ...

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang

Abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Abstract

Most of the recent generative image super-resolution (SR) methods rely on adapting large text-to-image (T2I) diffusion models pretrained on web-scale text-image data. While effective, this paradigm starts from a generic T2I generator, despite that SR is fundamentally a low-resolution (LR) input-conditioned image restoration task. In this work, we investigate whether an SR model trained purely on visual data can rival T2I-based ones. To this end, we propose VOSR, a Vision-Only generative framework for SR. We first extract semantically rich and spatially grounded features from the LR input using a pretrained vision encoder as visual semantic guidance. We then revisit classifier-free guidance for training generative models and show that the standard unconditional branch is ill-suited to restoration models trained from scratch. We therefore replace it with a restoration-oriented guidance strategy that preserves weak LR anchors. Built upon these designs, we first train a multi-step VOSR model from scratch and then distill it into a one-step model for efficient inference. VOSR requires less than one-tenth of the training cost of representative T2I-based SR methods, yet in both multi-step and one-step settings, it achieves competitive or even better perceptual quality and efficiency, while producing more faithful structures with fewer hallucinations on both synthetic and real-world benchmarks. Our results, for the first time, show that high-quality generative SR can be achieved without multimodal pretraining. The code and models can be found at https://github.com/cswry/VOSR.

Paper Structure

This paper contains 22 sections, 17 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of VOSR with existing generative SR methods in terms of performance, efficiency, and training cost. Blue/orange colors denote T2I-based and vision-only methods, circles/triangles denote multi-step and one-step models. Performance is measured on RealSR realsr, and efficiency is measured at $512\times512$ resolution using official repositories. VOSR achieves competitive or better perceptual quality than many T2I-based SR methods in both multi-step and one-step settings, while clearly outperforming prior vision-only methods. Its multi-step variant is substantially more efficient than existing T2I-based methods, and its one-step variant remains comparable to recent one-step T2I systems. Measured by the total number of training pixels consumed, VOSR requires only about one-tenth of the training cost of representative T2I-based SR methods. For fairness, we count only the pretraining cost of the core diffusion modules; reused components such as the VAE and semantic encoders are excluded.
  • Figure 2: Overview of VOSR. (a) Framework overview. Given an LR image, VOSR builds two complementary conditions from the input: a spatially aligned structural condition in the VAE latent space and a high-level visual semantic condition extracted by a pretrained vision encoder. These conditions are injected into a diffusion transformer to predict the denoising velocity for HR reconstruction. (b) Condition and guidance design. Compared with prior vision-only SR that mainly relies on structural-only conditioning, VOSR introduces an additional visual semantic condition to reduce semantic ambiguity in restoration. Moreover, instead of using a fully unconditional branch as in standard classifier-free guidance, VOSR adopts a restoration-oriented guidance and removes semantic guidance while retaining weakened LR structural cues, making the guidance input-anchored and more suitable for restoration.
  • Figure 3: Multi-step (top) and one-step (bottom) SR visual comparison on RealDeg chen2025faithdiffCf/0020.png and ScreenSR 010.png.
  • Figure 4: Effect of guidance scale on VOSR-1.4B-ms. As the scale increases, LPIPS rises while MUSIQ drops, indicating a shift from more generative to more faithful restoration to the LR input.
  • Figure 5: Thumbnail montage of the ScreenSR benchmark. The selected 130 GT images cover diverse scenarios, including indoor and outdoor scenes, humans, animals, plants, artworks, and multilingual text, with substantial variation in object and scene scales. This diversity ensures a comprehensive evaluation of generative SR methods in terms of semantic coverage, structural fidelity, and robustness across different content types.
  • ...and 2 more figures