Table of Contents
Fetching ...

Cross-Dataset Generalization For Retinal Lesions Segmentation

Clément Playout, Farida Cheriet

TL;DR

This work tackles cross-dataset generalization for retinal lesion segmentation by characterizing multiple public fundus datasets and assessing how annotation styles (coarse vs fine vs mixed) affect generalization. It trains a strong UNet-based segmentation model and systematically evaluates 31 dataset-combination configurations, uncovering that incorporating coarsely labeled data can boost performance on finely labeled test sets, while coarse-only training can hurt accuracy. The study also evaluates generalization techniques—ensembles, Stochastic Weight Averaging, and model soups—finding ensembles most consistently helpful but computationally expensive, and that SWA/model soups offer limited, inconsistent gains in segmentation. Overall, the work highlights practical strategies to leverage heterogeneous datasets to improve retinal lesion segmentation while outlining avenues for future improvement via noisy-label learning and domain adaptation.

Abstract

Identifying lesions in fundus images is an important milestone toward an automated and interpretable diagnosis of retinal diseases. To support research in this direction, multiple datasets have been released, proposing groundtruth maps for different lesions. However, important discrepancies exist between the annotations and raise the question of generalization across datasets. This study characterizes several known datasets and compares different techniques that have been proposed to enhance the generalisation performance of a model, such as stochastic weight averaging, model soups and ensembles. Our results provide insights into how to combine coarsely labelled data with a finely-grained dataset in order to improve the lesions segmentation.

Cross-Dataset Generalization For Retinal Lesions Segmentation

TL;DR

This work tackles cross-dataset generalization for retinal lesion segmentation by characterizing multiple public fundus datasets and assessing how annotation styles (coarse vs fine vs mixed) affect generalization. It trains a strong UNet-based segmentation model and systematically evaluates 31 dataset-combination configurations, uncovering that incorporating coarsely labeled data can boost performance on finely labeled test sets, while coarse-only training can hurt accuracy. The study also evaluates generalization techniques—ensembles, Stochastic Weight Averaging, and model soups—finding ensembles most consistently helpful but computationally expensive, and that SWA/model soups offer limited, inconsistent gains in segmentation. Overall, the work highlights practical strategies to leverage heterogeneous datasets to improve retinal lesion segmentation while outlining avenues for future improvement via noisy-label learning and domain adaptation.

Abstract

Identifying lesions in fundus images is an important milestone toward an automated and interpretable diagnosis of retinal diseases. To support research in this direction, multiple datasets have been released, proposing groundtruth maps for different lesions. However, important discrepancies exist between the annotations and raise the question of generalization across datasets. This study characterizes several known datasets and compares different techniques that have been proposed to enhance the generalisation performance of a model, such as stochastic weight averaging, model soups and ensembles. Our results provide insights into how to combine coarsely labelled data with a finely-grained dataset in order to improve the lesions segmentation.
Paper Structure (14 sections, 4 figures, 3 tables)

This paper contains 14 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Image quality as evaluated by the MCF-Net for each dataset. The distributions are normalized.
  • Figure 2: Joint distribution of lesions size (x-axis) and count (y-axis) per image for each dataset. For visualization purposes, both axes are in log-scale. The lower-right corner corresponds to coarser labels and the upper-left corner to finer ones. The overlapping distributions reveal the existence of three clusters in labelling styles: coarse, fine, and mixed.
  • Figure 3: Results of "leave one out" scenarios for different dataset combinations. The x-axis represents the training dataset combinations ordered by total number of images. In each case, the background colour corresponds to the overall labelling style (green=fine-grained, yellow=mixed, red=coarse). The shaded areas show the spread over 8 models trained in each case with different seeds. (a) Dice on IDRID: the best score (star) is obtained when training on a combination of datasets from overlapping fine-grained clusters (MES, DDR, FGA) (b) Dice on MES: the best score (star) is also obtained when training on fine-grained dataset while the worst score (dot) is obtained when training on a coarse dataset (RET).
  • Figure 4: Average Dice score for ensemble, SWA and model soup, for the 31 different training sets.