Table of Contents
Fetching ...

A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning

Stefano Cerri, Asbjørn Munk, Jakob Ambsdorf, Julia Machnio, Sebastian Nørgaard Llambias, Vardan Nersesjan, Christian Hedeager Krag, Peirong Liu, Pablo Rocamora García, Mostafa Mehdipour Ghazi, Mikael Boesen, Michael Eriksen Benros, Juan Eugenio Iglesias, Mads Nielsen

TL;DR

The FOMO300K dataset is a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

Abstract

We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

A large-scale heterogeneous 3D magnetic resonance brain imaging dataset for self-supervised learning

TL;DR

The FOMO300K dataset is a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

Abstract

We present FOMO300K, a large-scale, heterogeneous dataset of 318,877 brain Magnetic Resonance Imaging (MRI) scans from 82,678 MRI sessions and 59,969 subjects, aggregated from 920 publicly available sources. The dataset includes both clinical- and research-grade images, multiple MRI sequences, and a wide range of anatomical and pathological variability, including scans with large brain anomalies. Minimal preprocessing was applied to preserve the original image characteristics while reducing entry barriers for new users. Companion code for self-supervised pretraining and finetuning is provided, along with pretrained models. FOMO300K is intended to support the development and benchmarking of self-supervised learning methods in medical imaging at scale.

Paper Structure

This paper contains 20 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Representative examples from the FOMO300K dataset, illustrating the heterogeneity in image quality, MRI sequences, and the presence of brain anomalies.
  • Figure 2: (A) Age distribution of participants at each MRI session. (B) Distribution of subject groups at each MRI session. (C) Sex distribution at each MRI session (F = female, M = male). (D) Handedness distribution at each MRI session (R = right-handed, L = left-handed, A = ambidextrous). (E) Distribution of MRI scanner manufacturers across all scans. (F) Acquisition types showing the proportion of 3D and 2D sequences. (G) Field strength distribution. (H) Top 15 scanner models, with the MRI scanner manufacturer indicated by color coding. (I) Slice thickness distribution across the dataset (note: may include thickness of already resampled data from source datasets). (J) Top 15 MRI sequences, with modality indicated by color coding. Percentages are reported relative to the subset of data entries with available information for each variable.