Dense Self-Supervised Learning for Medical Image Segmentation

Maxime Seince; Loic Le Folgoc; Luiz Augusto Facury de Souza; Elsa Angelini

Dense Self-Supervised Learning for Medical Image Segmentation

Maxime Seince, Loic Le Folgoc, Luiz Augusto Facury de Souza, Elsa Angelini

TL;DR

This paper tackles the annotation bottleneck in medical image segmentation by introducing Pix2Rep, a dense self-supervised learning framework that learns pixel-level representations from unlabeled images. It pretrains encoder-decoder backbones (e.g., U-Net) with a pixel-level equivariance objective under geometric transforms and invariance to intensity augmentations, enabling effective few-shot segmentation. A contrastive variant (Pix2Rep) and a non-contrastive variant (Pix2Rep-v2) are presented, with downstream segmentation benefiting from linear probing or fine-tuning. On cardiac MRI, Pix2Rep yields notable Dice improvements across data regimes and substantially reduces annotation effort, with further gains when combined with Mean Teacher semi-supervised fine-tuning, indicating strong potential for scalable, pixel-accurate medical image segmentation.

Abstract

Deep learning has revolutionized medical image segmentation, but it relies heavily on high-quality annotations. The time, cost and expertise required to label images at the pixel-level for each new task has slowed down widespread adoption of the paradigm. We propose Pix2Rep, a self-supervised learning (SSL) approach for few-shot segmentation, that reduces the manual annotation burden by learning powerful pixel-level representations directly from unlabeled images. Pix2Rep is a novel pixel-level loss and pre-training paradigm for contrastive SSL on whole images. It is applied to generic encoder-decoder deep learning backbones (e.g., U-Net). Whereas most SSL methods enforce invariance of the learned image-level representations under intensity and spatial image augmentations, Pix2Rep enforces equivariance of the pixel-level representations. We demonstrate the framework on a task of cardiac MRI segmentation. Results show improved performance compared to existing semi- and self-supervised approaches; and a 5-fold reduction in the annotation burden for equivalent performance versus a fully supervised U-Net baseline. This includes a 30% (resp. 31%) DICE improvement for one-shot segmentation under linear-probing (resp. fine-tuning). Finally, we also integrate the novel Pix2Rep concept with the Barlow Twins non-contrastive SSL, which leads to even better segmentation performance.

Dense Self-Supervised Learning for Medical Image Segmentation

TL;DR

Abstract

Paper Structure (8 sections, 2 equations, 4 figures, 8 tables)

This paper contains 8 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Work
Methods
Experiments
Discussion & Conclusion
Additional Results
Dice Scores per Anatomical Structures
Visualization of the Pix2Rep Pixel Embeddings

Figures (4)

Figure 1: Pretraining of arbitrary encoder-decoder architectures $f$ (e.g., U-Net). $\mathbf{x}$ an unlabeled training image; $\phi\sim \mathcal{T}_s$ a random spatial transformation; $t,t'\sim\mathcal{T}_i$ two random intensity transformations; $g$ a projection head. We train pixel representation maps output by $f$ to be equivariant under $\phi$ and invariant to $t,t'$ by maximizing agreement between the outputs of the two branches, via a pixel-level contrastive loss.
Figure 2: Pixel embedding similarity maps. Large images: query images in which we select a query pixel (highlighted in red). For each query, we display two test images, with the pixel closest (in embedding space) to the query pixel highlighted in red. Similarity maps (cosine similarity between pixel embeddings) are also shown.
Figure 3: Proposed pretraining vs. fully-supervised baseline (same U-Net architecture).
Figure 4: Pix2Rep pixel-level embeddings. First and Second columns: test cardiac MRI images and ground truth segmentations. Third column: 2D t-SNE coordinates of Pix2Rep pixel embeddings. Fourth column: colored pixel embedding displayed in original MRI image space. The reference colormap used to map 2D t-SNE coordinates with individual colors is shown in the vignette on the top row example.

Dense Self-Supervised Learning for Medical Image Segmentation

TL;DR

Abstract

Dense Self-Supervised Learning for Medical Image Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)