Table of Contents
Fetching ...

Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization

Jeremiah Fadugba, Patrick Köhler, Lisa Koch, Petru Manescu, Philipp Berens

TL;DR

This work benchmarks retinal vessel segmentation across multiple architectures and loss functions using the large FIVES fundus dataset to study in-domain performance, cross-dataset generalization, and disease-related robustness. It finds that a standard UNet rivals more complex variants (e.g., FR-UNet, MA-Net) when trained on high-quality data, and that image quality is the primary determinant of segmentation quality, with cross-dataset generalization improving when trained on large, high-quality data like FIVES. The results provide practical guidance: prioritize dataset quality and size for cross-domain deployment, and treat architectural complexity as a secondary factor given sufficient data. These findings have direct implications for clinical deployment and dataset curation, suggesting that foundation-model approaches may further enhance generalization but current vessel segmentation benefits most from high-quality training data and careful evaluation across downstream clinical tasks.

Abstract

Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date. We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts. Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes. When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality. We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting

Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization

TL;DR

This work benchmarks retinal vessel segmentation across multiple architectures and loss functions using the large FIVES fundus dataset to study in-domain performance, cross-dataset generalization, and disease-related robustness. It finds that a standard UNet rivals more complex variants (e.g., FR-UNet, MA-Net) when trained on high-quality data, and that image quality is the primary determinant of segmentation quality, with cross-dataset generalization improving when trained on large, high-quality data like FIVES. The results provide practical guidance: prioritize dataset quality and size for cross-domain deployment, and treat architectural complexity as a secondary factor given sufficient data. These findings have direct implications for clinical deployment and dataset curation, suggesting that foundation-model approaches may further enhance generalization but current vessel segmentation benefits most from high-quality training data and careful evaluation across downstream clinical tasks.

Abstract

Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date. We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts. Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes. When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality. We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting
Paper Structure (18 sections, 5 equations, 4 figures, 5 tables)

This paper contains 18 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example images and manual segmentations for FIVES (including an example from each subgroup), DRIVE and CHASEDB1.
  • Figure 2: Cross dataset generalization in terms of Dice for each model. Each marker represents a tuple of the source (training) and the target (testing) dataset. The former is indicated by the marker type and the latter is indicated with a subscript letter. Hence the blue square with subscript F shows the mean Dice over the FIVES dataset for a model that was trained on CHASE DB. The vertical distance to the diagonal quantifies the domain gap.
  • Figure 3: Generalization across diseases. a) The segmentation performance in the 3 vs. 1 scenario, when models were trained on three pathological conditions only and tested on the remaining one. b) The average performance within the subgroup after the regular training procedure. c) Contrasts both settings directly, where each pair of points corresponds to entries in the heatmaps in a) and b).
  • Figure 4: Impact of image quality on segmentation performance. a: Example images with varying quality. Column 1: High quality in all aspects, Column 2: Low Illumination and Color (first image), low quality in all three categories (second image). b: Overall image quality, c: Image quality split up into its three different aspects Illumination and Colour, Blur and Contrast.