Table of Contents
Fetching ...

Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

Haoze Sun, Wenbo Li, Jiayue Liu, Kaiwen Zhou, Yongqiang Chen, Yong Guo, Yanwei Li, Renjing Pei, Long Peng, Yujiu Yang

TL;DR

This work tackles the generalization gap in real-world, diffusion-based image restoration by introducing text as an auxiliary invariant representation to reactivate generative priors on out-of-distribution data. The authors identify text richness and relevance as key factors, and implement Res-Captioner—a restoration-specific captioner with Chain-of-Thought captioning and a degradation-aware encoder—to adapt text inputs to content and degradation levels. They couple this with RealIR, a broad real-world benchmark, and demonstrate through extensive experiments that Res-Captioner consistently improves diffusion-based restoration models in both quantitative metrics and perceptual quality, across multiple backbones and degradation severities. The approach is plug-and-play and supported by a public benchmark, offering a practical path to robust real-world restoration with enhanced texture fidelity and stability across devices and conditions.

Abstract

Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.

Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration

TL;DR

This work tackles the generalization gap in real-world, diffusion-based image restoration by introducing text as an auxiliary invariant representation to reactivate generative priors on out-of-distribution data. The authors identify text richness and relevance as key factors, and implement Res-Captioner—a restoration-specific captioner with Chain-of-Thought captioning and a degradation-aware encoder—to adapt text inputs to content and degradation levels. They couple this with RealIR, a broad real-world benchmark, and demonstrate through extensive experiments that Res-Captioner consistently improves diffusion-based restoration models in both quantitative metrics and perceptual quality, across multiple backbones and degradation severities. The approach is plug-and-play and supported by a public benchmark, offering a practical path to robust real-world restoration with enhanced texture fidelity and stability across devices and conditions.

Abstract

Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.

Paper Structure

This paper contains 31 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: State-of-the-art methods like SUPIR supir are limited in utilizing their full generative capacity, often yielding blurred or otherwise unsatisfactory results on out-of-distribution (OOD) data, a phenomenon we term as "generative capability deactivation". Our Res-captioner can reactivate their generative capabilities by providing detailed and accurate descriptions.
  • Figure 2: Visualization of the text richness property. (Left) The richness of textures and details in the restored results increases with text richness. Text that is too short can result in the "generative capability deactivation" problem. Excessively long text can lead to messy generation and artifacts. (Right) We can classify image content into three categories based on the effect of increased text richness: I beneficial, II insensitive, and III detrimental.
  • Figure 3: Visualization and demonstration of the text relevance property. Left: The accuracy of textures and details in the restored results decreases as the text-replacing ratio increases, indicating that text relevance contributes to the fidelity of the restoration. Right: DISTS increases with a higher text-replacing ratio, further indicating a decrease in the fidelity of the restored results.
  • Figure 4: Demonstration of the richness property. (a, b): There is a positive correlation between text richness and the richness of textures in the restored results. (c, d): The optimal text richness (indicated by an asterisk) is proportional to the degree of deviation between the test degradation domain and the training degradation domain. Best viewed zoomed in.
  • Figure 4: Ablation studies on text richness, relevance, and harmful descriptions. We highlight best values for each metric.
  • ...and 9 more figures