Table of Contents
Fetching ...

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

Xiangtao Kong, Rongyuan Wu, Shuaizheng Liu, Lingchen Sun, Lei Zhang

TL;DR

NSARM tackles Real-ISR by combining a bitwise next-scale autoregressive model with a transformation network that maps LR inputs to preliminary scales. It introduces an end-to-end two-stage training regime that preserves Infinity’s generative priors while shaping a robust LR-to-HR generation pathway. The approach achieves superior perceptual quality and robustness across diverse degradations, while delivering faster inference than diffusion-based methods and avoiding the artifacts common in fixed-pretrained-tuned diffusion models. This work demonstrates the viability of pure autoregressive models for high-quality, robust Real-ISR and offers a scalable framework for leveraging large pre-trained priors in low-level vision tasks.

Abstract

Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM

NSARM: Next-Scale Autoregressive Modeling for Robust Real-World Image Super-Resolution

TL;DR

NSARM tackles Real-ISR by combining a bitwise next-scale autoregressive model with a transformation network that maps LR inputs to preliminary scales. It introduces an end-to-end two-stage training regime that preserves Infinity’s generative priors while shaping a robust LR-to-HR generation pathway. The approach achieves superior perceptual quality and robustness across diverse degradations, while delivering faster inference than diffusion-based methods and avoiding the artifacts common in fixed-pretrained-tuned diffusion models. This work demonstrates the viability of pure autoregressive models for high-quality, robust Real-ISR and offers a scalable framework for leveraging large pre-trained priors in low-level vision tasks.

Abstract

Most recent real-world image super-resolution (Real-ISR) methods employ pre-trained text-to-image (T2I) diffusion models to synthesize the high-quality image either from random Gaussian noise, which yields realistic results but is slow due to iterative denoising, or directly from the input low-quality image, which is efficient but at the price of lower output quality. These approaches train ControlNet or LoRA modules while keeping the pre-trained model fixed, which often introduces over-enhanced artifacts and hallucinations, suffering from the robustness to inputs of varying degradations. Recent visual autoregressive (AR) models, such as pre-trained Infinity, can provide strong T2I generation capabilities while offering superior efficiency by using the bitwise next-scale prediction strategy. Building upon next-scale prediction, we introduce a robust Real-ISR framework, namely Next-Scale Autoregressive Modeling (NSARM). Specifically, we train NSARM in two stages: a transformation network is first trained to map the input low-quality image to preliminary scales, followed by an end-to-end full-model fine-tuning. Such a comprehensive fine-tuning enhances the robustness of NSARM in Real-ISR tasks without compromising its generative capability. Extensive quantitative and qualitative evaluations demonstrate that as a pure AR model, NSARM achieves superior visual results over existing Real-ISR methods while maintaining a fast inference speed. Most importantly, it demonstrates much higher robustness to the quality of input images, showing stronger generalization performance. Project page: https://github.com/Xiangtaokong/NSARM

Paper Structure

This paper contains 20 sections, 7 equations, 17 figures, 4 tables, 2 algorithms.

Figures (17)

  • Figure 1: Top two rows: failure cases of existing Real-ISR methods, while our NSARM still works. Bottom row: sorted distributions of TOPIQ scores of competing methods on RealSR and RP60 datasets. We see that the quality curves of existing methods fall sharply in the late portion, indicating failure cases. For some methods, more than 10% of the cases can fail. Our method demonstrates significantly better robustness than existing methods.
  • Figure 2: The image decomposition process of VAR-like methods (left) and the framework of our proposed NSARM (right).
  • Figure 3: The top and bottom images are the generation results of Infinity with the $k$ scales replaced by clear or blurred images. NSARM is to establish a pathway towards the desired HQ by preliminary scales from the LR input.
  • Figure 4: The model robustness performance. The curves show the sorted average scores over CLIPIQA, MUSIQ, MANIQA (divided by 100) and TOPIQ for different methods on the RealSR, RP60 and DIV2K datasets. The table in each sub-figure lists the numbers of Deficient, Poor and Collapse cases for each method.
  • Figure 5: Visual comparison of different methods on RP60 (no GT), RealSR and DIV2K datasets (zoom in for a better view).
  • ...and 12 more figures