Table of Contents
Fetching ...

Rethinking Image Evaluation in Super-Resolution

Shaolin Su, Josep M. Rocafort, Danna Xue, David Serrano-Lozano, Lei Sun, Javier Vazquez-Corral

TL;DR

This work identifies imperfect ground-truth (GT) images as a key source of bias in SR evaluations, showing that model rankings and metric conclusions can be distorted when GT quality is poor. It introduces the Relative Quality Index (RQI), an order-sensitive, relative-quality metric trained with pairwise discrepancies to compare images without assuming a perfect reference. Across extensive user studies and multiple SR benchmarks, RQI demonstrates stronger alignment with human perception than traditional FR-IQA and NR-IQA metrics, and it offers robust fairness under GT imperfections. The findings underscore the need for high-quality GTs in SR datasets and provide a practical framework to evaluate SR methods more fairly, with potential applicability to other low-level vision tasks.

Abstract

While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

Rethinking Image Evaluation in Super-Resolution

TL;DR

This work identifies imperfect ground-truth (GT) images as a key source of bias in SR evaluations, showing that model rankings and metric conclusions can be distorted when GT quality is poor. It introduces the Relative Quality Index (RQI), an order-sensitive, relative-quality metric trained with pairwise discrepancies to compare images without assuming a perfect reference. Across extensive user studies and multiple SR benchmarks, RQI demonstrates stronger alignment with human perception than traditional FR-IQA and NR-IQA metrics, and it offers robust fairness under GT imperfections. The findings underscore the need for high-quality GTs in SR datasets and provide a practical framework to evaluate SR methods more fairly, with potential applicability to other low-level vision tasks.

Abstract

While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.

Paper Structure

This paper contains 21 sections, 2 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: We show that even Ground Truth (GT) images in existing SR datasets RealSRDRealSR can show relatively poor quality. As a result, image metrics tend to favor outputs that more resemble the reference GTs (middle), even when they are perceptually poorer (left side), leading to contradictory evaluations with human preferences (right side). Please zoom in for better comparisons.
  • Figure 2: GT Quality and degradation distributions in three SR datasets. All scores range from 0 to 100, (a): a higher quality score indicates better quality, (b)-(d): higher degradation scores indicate larger distortions.
  • Figure 3: GT samples from RealSR RealSR, DRealSR DRealSR and DIV2K div2k dataset. The images suffer from blur, vague details, and noise problems respectively. Please zoom in for better view.
  • Figure 4: We show how evaluations of 7 SR models change when low-quality GT are gradually discarded from the testing datasets.
  • Figure 5: The proposed RQI scheme differs from traditional FR-IQA scheme in three aspects: 1. RQI is order-sensitive. 2. We substitute reference image $I_0$ to any image $I_i$ in the distorted image sequence. 3. Relative quality discrepancy is used as label.
  • ...and 15 more figures