Table of Contents
Fetching ...

Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time

Ekaterina Kondrateva, Sandzhi Barg, Mikhail Vasiliev

TL;DR

This paper tackles the reproducibility of brain MRI morphometry across scanners and time by benchmarking two integrated segmentation pipelines, FastSurfer and SynthSeg, within FreeSurfer. Using SIMON (longitudinal) and SRPBS (multi-site) data, it quantifies inter-scan variability with Dice, Surface Dice, HD95, and MAPE, revealing up to 7–8% variation in small subcortical structures and raising questions about detecting 5–10% longitudinal changes in pea-sized regions. It further analyzes the impact of registration and interpolation choices and proposes surface-based quality filtering to improve reliability, delivering a reproducible benchmark and advocating harmonization strategies for real-world neuroimaging studies. The work underscores persistent domain-induced morphometric noise even for state-of-the-art pipelines and provides code to encourage transparent benchmarking and robust morphometry in multi-site settings.

Abstract

Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: https://github.com/kondratevakate/brain-mri-segmentation

Benchmarking the Reproducibility of Brain MRI Segmentation Across Scanners and Time

TL;DR

This paper tackles the reproducibility of brain MRI morphometry across scanners and time by benchmarking two integrated segmentation pipelines, FastSurfer and SynthSeg, within FreeSurfer. Using SIMON (longitudinal) and SRPBS (multi-site) data, it quantifies inter-scan variability with Dice, Surface Dice, HD95, and MAPE, revealing up to 7–8% variation in small subcortical structures and raising questions about detecting 5–10% longitudinal changes in pea-sized regions. It further analyzes the impact of registration and interpolation choices and proposes surface-based quality filtering to improve reliability, delivering a reproducible benchmark and advocating harmonization strategies for real-world neuroimaging studies. The work underscores persistent domain-induced morphometric noise even for state-of-the-art pipelines and provides code to encourage transparent benchmarking and robust morphometry in multi-site settings.

Abstract

Accurate and reproducible brain morphometry from structural MRI is critical for monitoring neuroanatomical changes across time and across imaging domains. Although deep learning has accelerated segmentation workflows, scanner-induced variability and reproducibility limitations remain-especially in longitudinal and multi-site settings. In this study, we benchmark two modern segmentation pipelines, FastSurfer and SynthSeg, both integrated into FreeSurfer, one of the most widely adopted tools in neuroimaging. Using two complementary datasets - a 17-year longitudinal cohort (SIMON) and a 9-site test-retest cohort (SRPBS)-we quantify inter-scan segmentation variability using Dice coefficient, Surface Dice, Hausdorff Distance (HD95), and Mean Absolute Percentage Error (MAPE). Our results reveal up to 7-8% volume variation in small subcortical structures such as the amygdala and ventral diencephalon, even under controlled test-retest conditions. This raises a key question: is it feasible to detect subtle longitudinal changes on the order of 5-10% in pea-sized brain regions, given the magnitude of domain-induced morphometric noise? We further analyze the effects of registration templates and interpolation modes, and propose surface-based quality filtering to improve segmentation reliability. This study provides a reproducible benchmark for morphometric reproducibility and emphasizes the need for harmonization strategies in real-world neuroimaging studies. Code and figures: https://github.com/kondratevakate/brain-mri-segmentation

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Volume stability for left/right hippocampus and amygdala across Subject $1$, 15 sessions in SRPBS Traveling Subject dataset. FastSurfer results with ANTS registration. The first 5 days (shaded) were acquired on the same scanner; subsequent sessions were acquired at different sites.
  • Figure 2: SIMON dataset: Volume trajectories of Amygdala and Hippocampus over time for 73 MRI scans in 17 years for one healthy individual using SynthSeg. Confidence intervals and regression trends are shown.
  • Figure 3: SIMON dataset: Comparison of volume distributions from FastSurfer and SynthSeg for Amygdala and Hippocampus, y-axis denotes volume in cm³.
  • Figure 4: Inter-scanner variability of cortical volumes in the SIMON dataset. Boxplots show DICE and Surface DICE metrics between consecutive scans, grouped by hemisphere.
  • Figure 5: Inter-scanner variability of cortical volumes in the SIMON dataset. Boxplots show the percentage difference from the structure-specific mean across repeated sessions, grouped by hemisphere.
  • ...and 1 more figures