Table of Contents
Fetching ...

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu

TL;DR

A large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality, and uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field.

Abstract

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.

How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices

TL;DR

A large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality, and uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field.

Abstract

Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.
Paper Structure (26 sections, 40 figures, 10 tables)

This paper contains 26 sections, 40 figures, 10 tables.

Figures (40)

  • Figure 1: Overview of the proposed dataset composition across Semantic and Degradation dimensions. The semantic categories (left) cover diverse visual contents to reveal the perceptual and structural challenges faced by generative restoration models. The degradation categories (right) reflect real-world conditions where restoration models are typically applied.
  • Figure 2: Distribution of annotated total scores across semantic scene groups. The horizontal width of each box indicates the percentage of samples within each score interval. The light gray region indicates low overall scores, representing generally unacceptable results.
  • Figure 3: Detail and sharpness score distributions across semantic scenes. The width of each box corresponds to the percentage of samples. The red line indicates the balance point; scores above it (Over $\uparrow$) denote over-generation, and scores below it (Less $\downarrow$) denote under-generation.
  • Figure 4: Illustration of our annotation criteria. The rows (top to bottom) indicate: Detail (under-generated [-2] to over-generated [+2]), Sharpness (blurred [-2] to over-sharpened [+2]), and Semantic correctness (severe failure [0] to fully consistent [4]).
  • Figure 5: Distribution of semantic consistency scores across scenes. Box widths represent the sample percentage at each score. The light gray region marks unacceptable results.
  • ...and 35 more figures