Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim; Jeongsol Kim; Jong Chul Ye

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim, Jeongsol Kim, Jong Chul Ye

TL;DR

The paper tackles the scalability gap in single-image super-resolution by introducing Chain-of-Zoom (CoZ), a scale-level autoregressive framework that uses intermediate scale-states ${\bm{x}}_0,\dots,{\bm{x}}_n$ and latent prompts ${\bm{c}}_i$ to decompose the high-to-low-dimensional mapping. It combines a pretrained SR backbone with multi-scale prompts from a Vision-Language Model, and optimizes prompt quality through a GRPO-based RLHF pipeline that aligns with human preferences. The approach is model-agnostic and demonstrates extreme magnifications up to $256\times$ using a $4\times$ SR backbone while maintaining perceptual quality. This enables efficient, retraining-free extension of existing SR systems to arbitrarily high resolutions, with practical implications for high-fidelity zooming in photography, medical imaging, and scientific visualization.

Abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/ .

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

TL;DR

Abstract

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (3)