Table of Contents
Fetching ...

Hierarchical storage management in user space for neuroimaging applications

Valérie Hayot-Sasson, Tristan Glatard

TL;DR

Neuroimaging data processing on HPCs faces data-transfer bottlenecks that standard tools do not address. Sea, a user-space data-management library using LD_PRELOAD to intercept glibc I/O, redirects I/O to compute-local caches and manages asynchronous flushing/eviction, enabling existing pipelines (FSL, SPM, AFNI) to run with reduced Lustre traffic. It achieves substantial speedups under Lustre contention (up to $32\times$ in controlled tests and up to $11\times$ in production) with minimal overhead when the filesystem is healthy, and also reduces the number of files written to shared storage. Sea provides a practical, non-invasive approach to data-local caching that preserves workflow compatibility while mitigating data-transfer costs in data-intensive neuroimaging analyses.

Abstract

Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intensive processing is that they must be entirely rewritten. To facilitate data management for standardized neuroimaging tools, we developed Sea, a library that intercepts and redirects application read and write calls to minimize data transfer time. In this paper, we investigate the performance of Sea on three preprocessing pipelines implemented using standard toolboxes (FSL, SPM and AFNI), using three neuroimaging datasets of different sizes (OpenNeuro's ds001545, PREVENT-AD and the HCP dataset) on two high-performance computing clusters. Our results demonstrate that Sea provides large speedups (up to 32X) when the shared file system's (e.g. Lustre) performance is deteriorated. When the shared file system is not overburdened by other users, performance is unaffected by Sea, suggesting that Sea's overhead is minimal even in cases where its benefits are limited. Overall, Sea is beneficial, even when performance gain is minimal, as it can be used to limit the number of files created on parallel file systems.

Hierarchical storage management in user space for neuroimaging applications

TL;DR

Neuroimaging data processing on HPCs faces data-transfer bottlenecks that standard tools do not address. Sea, a user-space data-management library using LD_PRELOAD to intercept glibc I/O, redirects I/O to compute-local caches and manages asynchronous flushing/eviction, enabling existing pipelines (FSL, SPM, AFNI) to run with reduced Lustre traffic. It achieves substantial speedups under Lustre contention (up to in controlled tests and up to in production) with minimal overhead when the filesystem is healthy, and also reduces the number of files written to shared storage. Sea provides a practical, non-invasive approach to data-local caching that preserves workflow compatibility while mitigating data-transfer costs in data-intensive neuroimaging analyses.

Abstract

Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intensive processing is that they must be entirely rewritten. To facilitate data management for standardized neuroimaging tools, we developed Sea, a library that intercepts and redirects application read and write calls to minimize data transfer time. In this paper, we investigate the performance of Sea on three preprocessing pipelines implemented using standard toolboxes (FSL, SPM and AFNI), using three neuroimaging datasets of different sizes (OpenNeuro's ds001545, PREVENT-AD and the HCP dataset) on two high-performance computing clusters. Our results demonstrate that Sea provides large speedups (up to 32X) when the shared file system's (e.g. Lustre) performance is deteriorated. When the shared file system is not overburdened by other users, performance is unaffected by Sea, suggesting that Sea's overhead is minimal even in cases where its benefits are limited. Overall, Sea is beneficial, even when performance gain is minimal, as it can be used to limit the number of files created on parallel file systems.
Paper Structure (23 sections, 5 figures, 2 tables)

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The Sea data-management library overview
  • Figure 2: Makespan comparison between Sea and Baseline on controlled dedicated cluster. Makespan denotes the total time between execution launch and completion of the last computing task.
  • Figure 3: Makespan comparison between Sea and tmpfs on the production cluster with flushing disabled
  • Figure 4: Makespan comparison between Sea and Baseline on the production cluster with flushing disabled
  • Figure 5: Makespan comparison between Sea and Baseline on the production cluster with flushing enabled for all files