Table of Contents
Fetching ...

An OpenMind for 3D medical vision self-supervised learning

Tassilo Wald, Constantin Ulrich, Jonathan Suprijadi, Sebastian Ziegler, Michal Nohel, Robin Peretzke, Gregor Köhler, Klaus H. Maier-Hein

TL;DR

This work tackles the lack of standardization in 3D medical SSL by introducing OpenMind, the largest public pre-training dataset for 3D brain MRI across 23 modalities, and a standardized OpenMind Benchmark to compare CNN and Transformer SSL approaches on diverse downstream tasks. It demonstrates that reconstruction-based pre-training (notably MAE) yields strong segmentation performance, while contrastive methods excel in classification, with Transformers like Primus-M showing meaningful gains when pre-trained. The study emphasizes the critical roles of fine-tuning schedules, data quality/diversity, and privacy-aware preprocessing, and provides open-source code and pretrained checkpoints to enable rapid reproduction and further method development. Overall, OpenMind serves as a foundation for data-centric and architecture-agnostic progress in 3D SSL for medical imaging and highlights directions for future improvements, including PEFT approaches and better cross-task generalization.

Abstract

The field of self-supervised learning (SSL) for 3D medical images lacks consistency and standardization. While many methods have been developed, it is impossible to identify the current state-of-the-art, due to i) varying and small pretraining datasets, ii) varying architectures, and iii) being evaluated on differing downstream datasets. In this paper, we bring clarity to this field and lay the foundation for further method advancements through three key contributions: We a) publish the largest publicly available pre-training dataset comprising 114k 3D brain MRI volumes, enabling all practitioners to pre-train on a large-scale dataset. We b) benchmark existing 3D self-supervised learning methods on this dataset for a state-of-the-art CNN and Transformer architecture, clarifying the state of 3D SSL pre-training. Among many findings, we show that pre-trained methods can exceed a strong from-scratch nnU-Net ResEnc-L baseline. Lastly, we c) publish the code of our pre-training and fine-tuning frameworks and provide the pre-trained models created during the benchmarking process to facilitate rapid adoption and reproduction.

An OpenMind for 3D medical vision self-supervised learning

TL;DR

This work tackles the lack of standardization in 3D medical SSL by introducing OpenMind, the largest public pre-training dataset for 3D brain MRI across 23 modalities, and a standardized OpenMind Benchmark to compare CNN and Transformer SSL approaches on diverse downstream tasks. It demonstrates that reconstruction-based pre-training (notably MAE) yields strong segmentation performance, while contrastive methods excel in classification, with Transformers like Primus-M showing meaningful gains when pre-trained. The study emphasizes the critical roles of fine-tuning schedules, data quality/diversity, and privacy-aware preprocessing, and provides open-source code and pretrained checkpoints to enable rapid reproduction and further method development. Overall, OpenMind serves as a foundation for data-centric and architecture-agnostic progress in 3D SSL for medical imaging and highlights directions for future improvements, including PEFT approaches and better cross-task generalization.

Abstract

The field of self-supervised learning (SSL) for 3D medical images lacks consistency and standardization. While many methods have been developed, it is impossible to identify the current state-of-the-art, due to i) varying and small pretraining datasets, ii) varying architectures, and iii) being evaluated on differing downstream datasets. In this paper, we bring clarity to this field and lay the foundation for further method advancements through three key contributions: We a) publish the largest publicly available pre-training dataset comprising 114k 3D brain MRI volumes, enabling all practitioners to pre-train on a large-scale dataset. We b) benchmark existing 3D self-supervised learning methods on this dataset for a state-of-the-art CNN and Transformer architecture, clarifying the state of 3D SSL pre-training. Among many findings, we show that pre-trained methods can exceed a strong from-scratch nnU-Net ResEnc-L baseline. Lastly, we c) publish the code of our pre-training and fine-tuning frameworks and provide the pre-trained models created during the benchmarking process to facilitate rapid adoption and reproduction.

Paper Structure

This paper contains 59 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The OpenMind dataset contains 114k 3D Head-And-Neck volumes of 23 different modalities. It represents the largest openly accessible dataset of 3D medical images currently available.
  • Figure 2: Head-and-Neck scans are often defaced, have the face blurred or have been brain-extracted to guarantee patient privacy. This can potentially harm reconstruction-based SSL methods. We provide anonymization and anatomy masks to allow taking this into account during method development.
  • Figure 3: DWI preprocessing pipeline: To derive 3D images from the 4D DWI images, they are processed through six steps, denoising, ringing removal, co-registration, field correction, brain extraction, and lastly 3D derivative creation. Best viewed on a screen to see the differences between steps.
  • Figure 4: Cumulative number of image volumes over number of datasets. 50% of all images of the entire OpenMind dataset originate from 12 datasets, while 81 datasets contribute 75%, 283 contribute 90% and the remaining 517 contribute the remaining 10%.
  • Figure 5: Metadata of the OpenMind Dataset: A total of 113,921 images from 800 datasets were curated and standardized, incorporating key metadata categories such as patient age, weight, BMI, sex, race, and health status, along with imaging modality. To enhance clarity, each pie chart and histogram in the figure only includes scans for which the respective metadata was available. The total number of available cases for each category is displayed above each graphic. Moreover, we denote that this number refers to images and not subjects of which there are only 34k. Therefore, this reflects the image-metadata pair availability instead of a per-patient score, which would not allow knowing the amount of scans with metadata.
  • ...and 2 more figures