Hierarchical storage management in user space for neuroimaging applications
Valérie Hayot-Sasson, Tristan Glatard
TL;DR
Neuroimaging data processing on HPCs faces data-transfer bottlenecks that standard tools do not address. Sea, a user-space data-management library using LD_PRELOAD to intercept glibc I/O, redirects I/O to compute-local caches and manages asynchronous flushing/eviction, enabling existing pipelines (FSL, SPM, AFNI) to run with reduced Lustre traffic. It achieves substantial speedups under Lustre contention (up to $32\times$ in controlled tests and up to $11\times$ in production) with minimal overhead when the filesystem is healthy, and also reduces the number of files written to shared storage. Sea provides a practical, non-invasive approach to data-local caching that preserves workflow compatibility while mitigating data-transfer costs in data-intensive neuroimaging analyses.
Abstract
Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intensive processing is that they must be entirely rewritten. To facilitate data management for standardized neuroimaging tools, we developed Sea, a library that intercepts and redirects application read and write calls to minimize data transfer time. In this paper, we investigate the performance of Sea on three preprocessing pipelines implemented using standard toolboxes (FSL, SPM and AFNI), using three neuroimaging datasets of different sizes (OpenNeuro's ds001545, PREVENT-AD and the HCP dataset) on two high-performance computing clusters. Our results demonstrate that Sea provides large speedups (up to 32X) when the shared file system's (e.g. Lustre) performance is deteriorated. When the shared file system is not overburdened by other users, performance is unaffected by Sea, suggesting that Sea's overhead is minimal even in cases where its benefits are limited. Overall, Sea is beneficial, even when performance gain is minimal, as it can be used to limit the number of files created on parallel file systems.
