Table of Contents
Fetching ...

An Analysis of Performance Bottlenecks in MRI Pre-Processing

Mathieu Dugré, Yohan Chatelain, Tristan Glatard

TL;DR

This work tackles the computational bottlenecks in MRI pre-processing pipelines used in neuroimaging by profiling popular toolboxes (ANTs, FSL, FreeSurfer) with the Intel VTune profiler across fMRIPrep sub-pipelines. The authors reveal a long-tail CPU-time distribution, with a small subset of functions consuming the majority of runtime, and identify linear interpolation as the dominant bottleneck alongside memory-access delays, quantified across a diverse healthy cohort. A notable finding is that single-precision ANTs can incur higher makespans due to an ITK-related double-precision output requirement, and FreeSurfer recon-all shows poor parallel scaling due to thread synchronization in OpenMP. The results provide a practical reference for optimization, highlight the need for careful consideration of reduced-precision techniques and OpenMP scheduling, and underscore profiling challenges on long-running HPC workflows. Overall, the study offers concrete targets for performance improvements and a methodological framework for evaluating MRI pre-processing pipelines.

Abstract

Magnetic Resonance Image (MRI) pre-processing is a critical step for neuroimaging analysis. However, the computational cost of MRI pre-processing pipelines is a major bottleneck for large cohort studies and some clinical applications. While High-Performance Computing (HPC) and, more recently, Deep Learning have been adopted to accelerate the computations, these techniques require costly hardware and are not accessible to all researchers. Therefore, it is important to understand the performance bottlenecks of MRI pre-processing pipelines to improve their performance. Using Intel VTune profiler, we characterized the bottlenecks of several commonly used MRI-preprocessing pipelines from the ANTs, FSL, and FreeSurfer toolboxes. We found that few functions contributed to most of the CPU time, and that linear interpolation was the largest contributor. Data access was also a substantial bottleneck. We identified a bug in the ITK library that impacts the performance of ANTs pipeline in single-precision and a potential issue with the OpenMP scaling in FreeSurfer recon-all. Our results provide a reference for future efforts to optimize MRI pre-processing pipelines.

An Analysis of Performance Bottlenecks in MRI Pre-Processing

TL;DR

This work tackles the computational bottlenecks in MRI pre-processing pipelines used in neuroimaging by profiling popular toolboxes (ANTs, FSL, FreeSurfer) with the Intel VTune profiler across fMRIPrep sub-pipelines. The authors reveal a long-tail CPU-time distribution, with a small subset of functions consuming the majority of runtime, and identify linear interpolation as the dominant bottleneck alongside memory-access delays, quantified across a diverse healthy cohort. A notable finding is that single-precision ANTs can incur higher makespans due to an ITK-related double-precision output requirement, and FreeSurfer recon-all shows poor parallel scaling due to thread synchronization in OpenMP. The results provide a practical reference for optimization, highlight the need for careful consideration of reduced-precision techniques and OpenMP scheduling, and underscore profiling challenges on long-running HPC workflows. Overall, the study offers concrete targets for performance improvements and a methodological framework for evaluating MRI pre-processing pipelines.

Abstract

Magnetic Resonance Image (MRI) pre-processing is a critical step for neuroimaging analysis. However, the computational cost of MRI pre-processing pipelines is a major bottleneck for large cohort studies and some clinical applications. While High-Performance Computing (HPC) and, more recently, Deep Learning have been adopted to accelerate the computations, these techniques require costly hardware and are not accessible to all researchers. Therefore, it is important to understand the performance bottlenecks of MRI pre-processing pipelines to improve their performance. Using Intel VTune profiler, we characterized the bottlenecks of several commonly used MRI-preprocessing pipelines from the ANTs, FSL, and FreeSurfer toolboxes. We found that few functions contributed to most of the CPU time, and that linear interpolation was the largest contributor. Data access was also a substantial bottleneck. We identified a bug in the ITK library that impacts the performance of ANTs pipeline in single-precision and a potential issue with the OpenMP scaling in FreeSurfer recon-all. Our results provide a reference for future efforts to optimize MRI pre-processing pipelines.
Paper Structure (17 sections, 2 equations, 7 figures, 3 tables)

This paper contains 17 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Distribution of the functions' CPU time. The left y-axis shows the average total CPU Time spent in a function, while the right y-axis show the cumulative CPU time percentage. The x-axis is the percentage of functions ordered by decreasing CPU time. The data includes all functions from all pipelines.
  • Figure 2: Comparison of makespan between double (blue) and single (orange) precision for ANTs brainExtraction and ANTs registrationSyN.
  • Figure 3: Average time per iteration for ANTs registrationSyN in double and single precision. Only the SyN Registration stage is shown as the two earlier stages were near zero time. The error bars show the standard deviation across n=20 subjects.
  • Figure 4: FreeSurfer recon-all analysis. The y-axis shows the average CPU time spent in each function, with error bars showing the standard deviation across n=20 subjects. The x-axis shows the function ordered by decreasing CPU time grouped by module. We omitted function names for clarity. The function ID are dependent to each plot. Supplementary materials show the mapping of the function ID to the function name for each plot.
  • Figure 5: Makespan and parallel efficiency of FreeSurfer recon-all while varying the number of threads from 1 to 32. The left y-axis shows the makespan in seconds, while the right y-axis shows the parallel efficiency in percent. The log-scaled x-axis shows the number of threads.
  • ...and 2 more figures