Table of Contents
Fetching ...

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

Junbo Qiao, Miaomiao Cai, Wei Li, Xudong Huang, Jie Hu, Xinghao Chen, Shaohui Lin, Hongkai Xiong

TL;DR

The paper tackles RealSR by addressing the critical challenge of understanding degraded content to enable faithful restoration. It introduces VLCoT, a vision-language Chain-of-Thought framework that progressively refines semantic understanding and image reconstruction in a coarse-to-fine manner. The approach is extended with VLCoT-GRPO, a multi-modal reinforcement learning paradigm that uses four reward signals to align vision and language reasoning and improve realism and fidelity. Empirical results show that RealSR-R1 achieves superior perceptual quality and stronger semantic alignment on real-world and synthetic benchmarks, demonstrating the potential of reasoning-guided cross-modal restoration for real-world image super-resolution.

Abstract

Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought

TL;DR

The paper tackles RealSR by addressing the critical challenge of understanding degraded content to enable faithful restoration. It introduces VLCoT, a vision-language Chain-of-Thought framework that progressively refines semantic understanding and image reconstruction in a coarse-to-fine manner. The approach is extended with VLCoT-GRPO, a multi-modal reinforcement learning paradigm that uses four reward signals to align vision and language reasoning and improve realism and fidelity. Empirical results show that RealSR-R1 achieves superior perceptual quality and stronger semantic alignment on real-world and synthetic benchmarks, demonstrating the potential of reasoning-guided cross-modal restoration for real-world image super-resolution.

Abstract

Real-World Image Super-Resolution is one of the most challenging task in image restoration. However, existing methods struggle with an accurate understanding of degraded image content, leading to reconstructed results that are both low-fidelity and unnatural. We present RealSR-R1 in this work, which empowers the RealSR models with understanding and reasoning capabilities. Inspired by the success of Chain of Thought (CoT) in large language models (LLMs), we simulate the human process of handling degraded images and propose the VLCoT framework, which integrates vision and language reasoning. The framework aims to precisely restore image details by progressively generating more comprehensive text and higher-resolution images. To overcome the challenge of traditional supervised learning CoT failing to generalize to real-world scenarios, we introduce, for the first time, Group Relative Policy Optimization (GRPO) into the Real-World Image Super-Resolution task. We propose VLCoT-GRPO as a solution, which designs four reward functions: (1) Format reward, used to standardize the CoT process; (2) Degradation reward, to incentivize accurate degradation estimation; (3) Understanding reward, to ensure the accuracy of the generated content; and (4) Generation reward, where we propose using a visual expert model to evaluate the quality of generated images, encouraging the model to generate more realistic images. Extensive experiments demonstrate that our proposed RealSR-R1 can generate realistic details and accurately understand image content, particularly in semantically rich scenes or images with severe degradation.

Paper Structure

This paper contains 15 sections, 6 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) The GAN and diffusion methods themselves do not possess the capability to understand the content of images. (b) Directly interpreting and generating detailed descriptions of degraded images leads to incorrect understanding and image restoration. (c) RealSR-R1 simulates the human image restoration process by progressively refining the understanding of image content and generating higher-quality images. (d) Incorrect semantics lead the model to reconstruct erroneous content, whereas correct semantic recovery of fine-grained details, such as plant textures, is crucial.
  • Figure 2: llustration of the proposed RealSR-R1. The multi-step output in the center of the image represents the VLCoT process. Four specially designed reward functions are displayed at the bottom, including format reward, degradation reward, understanding reward, and generation reward.
  • Figure 3: Visual example of step-by-step generation of detailed image descriptions.
  • Figure 4: Illustrative example of a vision expert model assigning scores to a set of images.
  • Figure 5: Qualitative comparisons with different SOTA methods.
  • ...and 3 more figures