Table of Contents
Fetching ...

Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Zhipeng Wang, Shao Tang, Oana Dumitrascu, Yalin Wang

Abstract

Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.

Bridging Restoration and Diagnosis: A Comprehensive Benchmark for Retinal Fundus Enhancement

Abstract

Over the past decade, generative models have demonstrated success in enhancing fundus images. However, the evaluation of these models remains a challenge. A benchmark for fundus image enhancement is needed for three main reasons:(1) Conventional denoising metrics such as PSNR and SSIM fail to capture clinically relevant features, such as lesion preservation and vessel morphology consistency, limiting their applicability in real-world settings; (2) There is a lack of unified evaluation protocols that address both paired and unpaired enhancement methods, particularly those guided by clinical expertise; and (3) An evaluation framework should provide actionable insights to guide future advancements in clinically aligned enhancement models. To address these gaps, we introduce EyeBench-V2, a benchmark designed to bridge the gap between enhancement model performance and clinical utility. Our work offers three key contributions:(1) Multi-dimensional clinical-alignment through downstream evaluations: Beyond standard enhancement metrics, we assess performance across clinically meaningful tasks including vessel segmentation, diabetic retinopathy (DR) grading, generalization to unseen noise patterns, and lesion segmentation. (2) Expert-guided evaluation design: We curate a novel dataset enabling fair comparisons between paired and unpaired enhancement methods, accompanied by a structured manual assessment protocol by medical experts, which evaluates clinically critical aspects such as lesion structure alterations, background color shifts, and the introduction of artificial structures. (3) Actionable insights: Our benchmark provides a rigorous, task-oriented analysis of existing generative models, equipping clinical researchers with the evidence needed to make informed decisions, while also identifying limitations in current methods to inform the design of next-generation enhancement models.

Paper Structure

This paper contains 30 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of EyeBench-v2. We present EyeBench-v2, a systematic and comprehensive benchmark designed to evaluate retinal image enhancement models. The evaluation pipeline encompasses both Full-Reference and No-Reference assessments, enabling a robust multi-dimensional analysis of enhancement quality. For each evaluation aspect, we construct a distribution-aligned dataset to ensure fair, reproducible, and clinically relevant comparisons. Additionally, we incorporate clinically consistent downstream tasks to assess models' generalization in denoising and their capacity to preserve diagnostically important features. it also includes expert-guided annotations developed in accordance with established clinical protocols. Our statistical analysis demonstrates that EyeBench-v2 scores strongly align with clinical quality preferences. Finally, EyeBench-v2 facilitates a rigorous and systematic evaluation of existing GAN-based and SDE-based approaches, uncovering key limitations and offering actionable insights into promising solutions for advancing retinal image enhancement.
  • Figure 2: T-SNE van2008visualizing visualizations of the latent representation features extracted from the RETfound (A)zhou2023foundation and Ret-Clip (B)du2024ret image encoder in no-reference evaluation. Here, blue points illustrate synthetic high-quality image $\hat{\mathbf{y}}_1$ features while green points show true high-quality image $\mathbf{y}_2$ features. Closer proximity of the distributions indicates improved denoising performance of the unpaired method. More details provided in Sec. \ref{['subsec:exp']}.
  • Figure 3: Validation of Expert Clinic Preference Alignment via Spearman’s correlation coefficient ($r$) between the medical experts (i.e., Experts Protocol Evaluation) and other tasks. Notably, our multi-dimensional evaluations (e.g., Full-Reference, No-Reference) demonstrated stronger correlation.
  • Figure 4: Illustration of the limitations of SDE-based methods. (A) and (B) show T-SNE van2008visualizing visualizations of features from synthetic high-quality images $\hat{\mathbf{y}}_1^{t_i}$ (blue points) and true high-quality images $\mathbf{y}_2$ (green points), extracted using the RETFound zhou2023foundation and Ret-Clip du2024ret image encoders, respectively. Yellow arrows indicate the squeezing effect, while gray arrows denote feature drift over time. $d_i$ and $FID_i$ represent the center distance and corresponding FID score at time step $t_i$. (C) visualizes attention drift in the skip-connection features maps of the generator dong2024cunsb as the time step increases. Red boxes highlight lesion structures that gradually receive less attention. More details are discussed in Sec. \ref{['Sec:further-analysis']}.
  • Figure 5: Overview of EyeQ fu2019evaluation dataset. (A) highlights attribute distributions (i.e., brightness, contrast, sharpness) and diabetic retinopathy (DR) grades across quality categories (i.e., good, usable, and reject). (B) illustrates histograms for the training (i.e., part A and part B), testing, and validation datasets used in Full-Reference evaluations after resampling, with the workflow of degradation algorithms outlined below. (C) shows histograms for real-world No-Reference experiments after resampling. (D) presents reject-quality samples (e.g., overprocessed images).
  • ...and 5 more figures