Dense Self-Supervised Learning for Medical Image Segmentation
Maxime Seince, Loic Le Folgoc, Luiz Augusto Facury de Souza, Elsa Angelini
TL;DR
This paper tackles the annotation bottleneck in medical image segmentation by introducing Pix2Rep, a dense self-supervised learning framework that learns pixel-level representations from unlabeled images. It pretrains encoder-decoder backbones (e.g., U-Net) with a pixel-level equivariance objective under geometric transforms and invariance to intensity augmentations, enabling effective few-shot segmentation. A contrastive variant (Pix2Rep) and a non-contrastive variant (Pix2Rep-v2) are presented, with downstream segmentation benefiting from linear probing or fine-tuning. On cardiac MRI, Pix2Rep yields notable Dice improvements across data regimes and substantially reduces annotation effort, with further gains when combined with Mean Teacher semi-supervised fine-tuning, indicating strong potential for scalable, pixel-accurate medical image segmentation.
Abstract
Deep learning has revolutionized medical image segmentation, but it relies heavily on high-quality annotations. The time, cost and expertise required to label images at the pixel-level for each new task has slowed down widespread adoption of the paradigm. We propose Pix2Rep, a self-supervised learning (SSL) approach for few-shot segmentation, that reduces the manual annotation burden by learning powerful pixel-level representations directly from unlabeled images. Pix2Rep is a novel pixel-level loss and pre-training paradigm for contrastive SSL on whole images. It is applied to generic encoder-decoder deep learning backbones (e.g., U-Net). Whereas most SSL methods enforce invariance of the learned image-level representations under intensity and spatial image augmentations, Pix2Rep enforces equivariance of the pixel-level representations. We demonstrate the framework on a task of cardiac MRI segmentation. Results show improved performance compared to existing semi- and self-supervised approaches; and a 5-fold reduction in the annotation burden for equivalent performance versus a fully supervised U-Net baseline. This includes a 30% (resp. 31%) DICE improvement for one-shot segmentation under linear-probing (resp. fine-tuning). Finally, we also integrate the novel Pix2Rep concept with the Barlow Twins non-contrastive SSL, which leads to even better segmentation performance.
