Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

José Morano; Guilherme Aresta; Dmitrii Lachinov; Julia Mai; Ursula Schmidt-Erfurth; Hrvoje Bogunović

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

José Morano, Guilherme Aresta, Dmitrii Lachinov, Julia Mai, Ursula Schmidt-Erfurth, Hrvoje Bogunović

TL;DR

The paper tackles label-efficient 3D→2D segmentation in OCT by introducing a full-volume CNN with a 3D encoder and 2D decoder linked through novel 3D→2D feature projection blocks (FPB). It pairs this architecture with a self-supervised pretraining scheme that reconstructs cross-dimensional modality pairs (e.g., OCT to SLO/FAF) to learn robust representations without labels. On GA and RPD segmentation tasks, the approach outperforms state-of-the-art methods in low-data settings, achieving up to a 23% Dice gain with SSL and at least an 8% gain without SSL, with FAF-based SSL often providing higher gains while SLO-based SSL offers registration-free benefits. The findings suggest broad applicability of the SSL paradigm to other 3D→2D tasks and multi-modal medical imaging domains, enabling more data-efficient deployment in clinical workflows.

Abstract

Deep learning has become a valuable tool for the automation of certain medical image segmentation tasks, significantly relieving the workload of medical specialists. Some of these tasks require segmentation to be performed on a subset of the input dimensions, the most common case being 3D-to-2D. However, the performance of existing methods is strongly conditioned by the amount of labeled data available, as there is currently no data efficient method, e.g. transfer learning, that has been validated on these tasks. In this work, we propose a novel convolutional neural network (CNN) and self-supervised learning (SSL) method for label-efficient 3D-to-2D segmentation. The CNN is composed of a 3D encoder and a 2D decoder connected by novel 3D-to-2D blocks. The SSL method consists of reconstructing image pairs of modalities with different dimensionality. The approach has been validated in two tasks with clinical relevance: the en-face segmentation of geographic atrophy and reticular pseudodrusen in optical coherence tomography. Results on different datasets demonstrate that the proposed CNN significantly improves the state of the art in scenarios with limited labeled data by up to 8% in Dice score. Moreover, the proposed SSL method allows further improvement of this performance by up to 23%, and we show that the SSL is beneficial regardless of the network architecture.

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

TL;DR

Abstract

Paper Structure (14 sections, 5 figures, 5 tables)

This paper contains 14 sections, 5 figures, 5 tables.

Introduction
Contributions.
Clinical background.
Methods and experimental setup
Network architecture.
Training losses.
Datasets.
Training and evaluation details.
Results and discussion
Baseline comparison.
SSL effect.
Reconstructed modality effect.
Conclusions
Acknowledgements.

Figures (5)

Figure 1: From left to right: OCT slice (B-scan) with the corresponding ground truth annotations overlaid in green, ground truth, SLO with the location of the B-scan indicated in yellow and a zoom-in view in red, and FAF. Top: GA. Bottom: RPD.
Figure 2: Illustration of the proposed approach for 3D$\rightarrow$2D segmentation. A novel 3D$\rightarrow$2D model is trained for reconstructing image pairs of modalities with different dimensionality in a SSL setting, and then fine-tuned in the target segmentation task.
Figure 3: Proposed 3D$\rightarrow$2D CNN. Each residual encoder block has 8 3D convolutional layers, and each residual decoder block has 4 2D layers (number of feature maps also shown). The proposed feature projection block (FPB, in red) projects 3D features to the 2D feature space. FPBs have a variable number of $1 \times 1 \times 3$ convolutions followed by a $1 \times 1 \times 4$ convolution and a depth-wise adaptive average pooling of size 1.
Figure 4: Segmentation results of the models trained with different amounts of data. The title of each plot indicates the test dataset. If a model was pre-trained with SSL, the pre-training modality is shown in parentheses. A table with all means and standard deviations, as well as the results of a Wilcoxon signed rank test between our proposal and the others is included in the Supplement.
Figure 5: Examples of GA (top) and RPD (bottom) segmentations from different models using the 5% and the 20% of the training data, respectively. True positives are depicted in green; true negatives, in black; false positives, in red; and false negatives, in blue.

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

TL;DR

Abstract

Self-supervised learning via inter-modal reconstruction and feature projection networks for label-efficient 3D-to-2D segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)