Table of Contents
Fetching ...

Visual Autoregressive Modeling for Image Super-Resolution

Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, Chao Zhou

TL;DR

This work addresses the ill-posed nature of image super-resolution by introducing VARSR, a visual autoregressive framework that generates HR images through next-scale prediction. It combines Prefix Tokens for efficient LR conditioning, Scale-aligned RoPE to preserve 2D spatial structure across scales, a lightweight Diffusion Refiner to model quantization residuals, and an Image-based Classifier-Free Guidance mechanism to enhance realism. A large-scale high-quality dataset (≈4M images) and a staged training pipeline (C2I pretraining followed by ISR finetuning) underpin robust generative priors and strong performance. Empirical results show VARSR achieves high fidelity and realism with significantly improved efficiency relative to diffusion-based methods, along with strong real-world performance and ablations that validate each component.

Abstract

Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at https://github.com/qyp2000/VARSR.

Visual Autoregressive Modeling for Image Super-Resolution

TL;DR

This work addresses the ill-posed nature of image super-resolution by introducing VARSR, a visual autoregressive framework that generates HR images through next-scale prediction. It combines Prefix Tokens for efficient LR conditioning, Scale-aligned RoPE to preserve 2D spatial structure across scales, a lightweight Diffusion Refiner to model quantization residuals, and an Image-based Classifier-Free Guidance mechanism to enhance realism. A large-scale high-quality dataset (≈4M images) and a staged training pipeline (C2I pretraining followed by ISR finetuning) underpin robust generative priors and strong performance. Empirical results show VARSR achieves high fidelity and realism with significantly improved efficiency relative to diffusion-based methods, along with strong real-world performance and ablations that validate each component.

Abstract

Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models. However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application. Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction. To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition. Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity. Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images. Furthermore, we collect large-scale data and design a training process to obtain robust generative priors. Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods. Our codes will be released at https://github.com/qyp2000/VARSR.

Paper Structure

This paper contains 54 sections, 12 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: VARSR framework, which can be divided into three parts: (1) LR image is set as Prefix Tokens as condition. (2) VAR generates discrete tokens based on next-scale prediction. (3) Diffusion Refiner predicts the continious tokens as quantization residuals.
  • Figure 2: Internal structure of the autoregressive transformer. SA-RoPE represents the spatial structure. Quality-centered control generates offsets for autoregression.
  • Figure 3: Image tokenization process of VQVAE. The quantizer converts the image latent to multi-scale discrete tokens while representing the quantization loss as continuous tokens.
  • Figure 4: Limitations of current full-reference metrics (e.g., PSNR, SSIM, LPIPS). VARSR has generated images of higher perceptual quality for humans, yet it lags behind in certain metrics.
  • Figure 5: Qualitative comparisons with different SOTA methods. Zoom in for a better view.
  • ...and 12 more figures