Table of Contents
Fetching ...

Self-supervised Learning of Dense Hierarchical Representations for Medical Image Segmentation

Eytan Kats, Jochen G. Hirsch, Mattias P. Heinrich

TL;DR

The paper tackles the need for data-efficient dense segmentation in medical imaging by learning voxel-wise, coarse-to-fine representations with a self-supervised framework. It mitigates the inherent bias of feature pyramids toward global context through local augmentations, a hierarchically balanced architecture, and a hybrid contrastive-restorative loss. Across MRI and CT tasks, the approach improves linear evaluation and enables strong fine-tuning with limited labels, outperforming the voxel2vec baseline and even models pretrained on more data. These results suggest a practical path toward scalable, annotation-efficient medical image segmentation and provide pretrained models for public use.

Abstract

This paper demonstrates a self-supervised framework for learning voxel-wise coarse-to-fine representations tailored for dense downstream tasks. Our approach stems from the observation that existing methods for hierarchical representation learning tend to prioritize global features over local features due to inherent architectural bias. To address this challenge, we devise a training strategy that balances the contributions of features from multiple scales, ensuring that the learned representations capture both coarse and fine-grained details. Our strategy incorporates 3-fold improvements: (1) local data augmentations, (2) a hierarchically balanced architecture, and (3) a hybrid contrastive-restorative loss function. We evaluate our method on CT and MRI data and demonstrate that our new approach particularly beneficial for fine-tuning with limited annotated data and consistently outperforms the baseline counterpart in linear evaluation settings.

Self-supervised Learning of Dense Hierarchical Representations for Medical Image Segmentation

TL;DR

The paper tackles the need for data-efficient dense segmentation in medical imaging by learning voxel-wise, coarse-to-fine representations with a self-supervised framework. It mitigates the inherent bias of feature pyramids toward global context through local augmentations, a hierarchically balanced architecture, and a hybrid contrastive-restorative loss. Across MRI and CT tasks, the approach improves linear evaluation and enables strong fine-tuning with limited labels, outperforming the voxel2vec baseline and even models pretrained on more data. These results suggest a practical path toward scalable, annotation-efficient medical image segmentation and provide pretrained models for public use.

Abstract

This paper demonstrates a self-supervised framework for learning voxel-wise coarse-to-fine representations tailored for dense downstream tasks. Our approach stems from the observation that existing methods for hierarchical representation learning tend to prioritize global features over local features due to inherent architectural bias. To address this challenge, we devise a training strategy that balances the contributions of features from multiple scales, ensuring that the learned representations capture both coarse and fine-grained details. Our strategy incorporates 3-fold improvements: (1) local data augmentations, (2) a hierarchically balanced architecture, and (3) a hybrid contrastive-restorative loss function. We evaluate our method on CT and MRI data and demonstrate that our new approach particularly beneficial for fine-tuning with limited annotated data and consistently outperforms the baseline counterpart in linear evaluation settings.
Paper Structure (12 sections, 3 equations, 2 figures, 3 tables)

This paper contains 12 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Fine-tuning performance for varying amounts of data. The benefit of pre-training are particularly evident when only limited labeled data is available.
  • Figure 2: Pre-training pipeline. Two overlapping patches augmented and processed by the FPN. The feature maps from the different scale levels projected to the same channel size. Feature vectors sampled from the projected scale levels in corresponding positions to form hierarchically balanced voxel representation vectors. Voxel representations projected to the latent space where the InfoNCE loss is calculated.