Table of Contents
Fetching ...

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye

TL;DR

The paper tackles the scalability gap in single-image super-resolution by introducing Chain-of-Zoom (CoZ), a scale-level autoregressive framework that uses intermediate scale-states ${\bm{x}}_0,\dots,{\bm{x}}_n$ and latent prompts ${\bm{c}}_i$ to decompose the high-to-low-dimensional mapping. It combines a pretrained SR backbone with multi-scale prompts from a Vision-Language Model, and optimizes prompt quality through a GRPO-based RLHF pipeline that aligns with human preferences. The approach is model-agnostic and demonstrates extreme magnifications up to $256\times$ using a $4\times$ SR backbone while maintaining perceptual quality. This enables efficient, retraining-free extension of existing SR systems to arbitrarily high resolutions, with practical implications for high-fidelity zooming in photography, medical imaging, and scientific visualization.

Abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

TL;DR

The paper tackles the scalability gap in single-image super-resolution by introducing Chain-of-Zoom (CoZ), a scale-level autoregressive framework that uses intermediate scale-states and latent prompts to decompose the high-to-low-dimensional mapping. It combines a pretrained SR backbone with multi-scale prompts from a Vision-Language Model, and optimizes prompt quality through a GRPO-based RLHF pipeline that aligns with human preferences. The approach is model-agnostic and demonstrates extreme magnifications up to using a SR backbone while maintaining perceptual quality. This enables efficient, retraining-free extension of existing SR systems to arbitrarily high resolutions, with practical implications for high-fidelity zooming in photography, medical imaging, and scientific visualization.

Abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Paper Structure

This paper contains 24 sections, 2 theorems, 11 equations, 15 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Given a sequence of scale-states ${\bm{x}}_i$ that follows a AR-2 structure and latent variables ${\bm{c}}_i$ that satisfy Eq. (eq:latent), the joint distribution is expressed as

Figures (15)

  • Figure 1: Extreme super-resolution of photorealistic images by CoZ with up to 64× magnification (top) and 256× magnification (bottom). Fine details such as textures on a wall, wrinkles on a flag, and leaf veins are clearly seen.
  • Figure 2: (a) Conventional SR. When an SR backbone trained for a fixed up-scale factor (e.g., 4$\times$) is pushed to much larger magnifications beyond its training regime, blur and artifacts are produced. (b) Chain-of-Zoom (ours). Starting from an LR input, a pretrained VLM generates a descriptive prompt, which—together with the image—is fed to the same SR backbone to yield the next HR scale-state. This prompt-and-upscale cycle is repeated, allowing a single off-the-shelf model to climb to extreme resolutions (16$\times$–256$\times$) while preserving sharp detail and semantic fidelity.
  • Figure 3: Significance of proposed multi-scale-aware prompts:(a) Null prompt: coarse structure is retained, but high-frequency details are smoothed out. (b) DAPE prompt: inserting text from a degradation-aware prompt extractor (DAPE) helps, yet the images lack intricate detail at large magnifications. (c) VLM-generated prompts (ours): multi-scale prompts extracted by a VLM steer the SR backbone to synthesize realistic textures and crisp details.
  • Figure 4: GRPO Training Framework. At every zoom step, multi-scale image crops are fed to the base VLM, which generates candidate prompts after perceiving input images. A critic VLM scores the prompt for semantic quality, while phrase-exclusion and repetition penalties enforce conciseness and relevance. The weighted sum of these rewards forms the GRPO signal that iteratively fine-tunes the base VLM, steering it towards prompts that best guide extreme-scale super-resolution.
  • Figure 5: Qualitative Results. For each input image, super-resolution is performed on different magnifications with various methods: (a) Nearest neighbor interpolation; (b) One-step direct SR with the backbone SR model; (c-e) Variants of CoZ with different text prompts. The CoZ framework shows significantly better performance at large magnifications. Furthermore, with preference alignment with GRPO, our CoZ leveraging VLM prompts assists the SR model in generating realistic details without hallucinations.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 1
  • proof