Table of Contents
Fetching ...

EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang

TL;DR

EyeBench addresses a critical gap in evaluating retinal fundus image enhancement by introducing a multi-dimensional benchmark that jointly considers full-reference and no-reference quality, plus clinically meaningful downstream tasks. It combines distribution-aligned datasets, expert-guided annotations, and multi-task evaluation to assess how well enhancement methods preserve vessels, lesions, and disease-related information. The study finds that multi-dimensional assessments better reflect clinical preferences than single-metric evaluations, and reveals distinct strengths and trade-offs among paired, unpaired, OT-based, and SDE-based methods. Overall, EyeBench provides a practical framework and insights to guide future development toward clinically relevant retinal image enhancement.

Abstract

Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}

EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

TL;DR

EyeBench addresses a critical gap in evaluating retinal fundus image enhancement by introducing a multi-dimensional benchmark that jointly considers full-reference and no-reference quality, plus clinically meaningful downstream tasks. It combines distribution-aligned datasets, expert-guided annotations, and multi-task evaluation to assess how well enhancement methods preserve vessels, lesions, and disease-related information. The study finds that multi-dimensional assessments better reflect clinical preferences than single-metric evaluations, and reveals distinct strengths and trade-offs among paired, unpaired, OT-based, and SDE-based methods. Overall, EyeBench provides a practical framework and insights to guide future development toward clinically relevant retinal image enhancement.

Abstract

Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench}

Paper Structure

This paper contains 29 sections, 5 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Overview of EyeBench. We introduce EyeBench, a systematic and rigorous benchmark for evaluating retinal image enhancement models. Our evaluation pipeline comprehensively assesses fundus image enhancement quality through both No-Reference and Full-Reference aspects, facilitating a multi-dimensional evaluation. For each aspect, we design a distribution-aligned dataset to ensure fair and clinically meaningful comparisons. Additionally, we include clinically consistent downstream tasks to quantify models' ability in denoising generalization and downstream preserving. Our benchmark also incorporates medical experts guided annotations, adhering to expert protocols, and we statistically validate that EyeBench results aligned well with clinic preference assessment. Finally, we highlight current challenges to inform future development. EyeBench can provide multiple insights from multiple perspectives.
  • Figure 2: (A) highlights attribute distributions (i.e., brightness, contrast, sharpness) and diabetic retinopathy (DR) grades across quality categories (i.e., good, usable, and reject). (B) illustrates histograms for the training (i.e., part A and part B), testing, and validation datasets used in Full-Reference evaluations after resampling, with the workflow of degradation algorithms outlined below. (C) shows histograms for real-world No-Reference experiments after resampling. (D) presents samples to be overprocessed.
  • Figure 3: An illustrative medical expert clinical preference evaluation between (a) lesion preserving, (b) background preserving, and (c) structure-preserving.
  • Figure 4: Validation of Expert Clinic Preference Alignment via Spearman’s correlation coefficient ($r$), which is used to assess the correlation between the Experts Protocol preference evaluation and other Eyebench evaluations. Single-dimension evaluations (e.g., denoising, segmentation) may show weak alignment with clinic preferences, while Eyebench multi-dimensional evaluations (e.g., Full-Reference, No-Reference) demonstrated stronger correlation.
  • Figure 5: T-SNE visualizations of the latent representation features extracted from the RET-Clip and RETfound models. Closer proximity of the distributions indicates improved denoising performance of the unpaired method. This analysis demonstrates the effectiveness of the retrieval-enhanced frameworks in capturing and preserving meaningful feature representations. The Euclidean distance between the distribution centroids is showcased under each plot.
  • ...and 6 more figures