Table of Contents
Fetching ...

Scaling up ridge regression for brain encoding in a massive individual fMRI dataset

Sana Ahmadi, Pierre Bellec, Tristan Glatard

TL;DR

Batch parallelization using Dask emerges as a scalable approach for brain encoding with ridge regression on high-performance computing systems using scikit-learn and large fMRI datasets.

Abstract

Brain encoding with neuroimaging data is an established analysis aimed at predicting human brain activity directly from complex stimuli features such as movie frames. Typically, these features are the latent space representation from an artificial neural network, and the stimuli are image, audio, or text inputs. Ridge regression is a popular prediction model for brain encoding due to its good out-of-sample generalization performance. However, training a ridge regression model can be highly time-consuming when dealing with large-scale deep functional magnetic resonance imaging (fMRI) datasets that include many space-time samples of brain activity. This paper evaluates different parallelization techniques to reduce the training time of brain encoding with ridge regression on the CNeuroMod Friends dataset, one of the largest deep fMRI resource currently available. With multi-threading, our results show that the Intel Math Kernel Library (MKL) significantly outperforms the OpenBLAS library, being 1.9 times faster using 32 threads on a single machine. We then evaluated the Dask multi-CPU implementation of ridge regression readily available in scikit-learn (MultiOutput), and we proposed a new "batch" version of Dask parallelization, motivated by a time complexity analysis. In line with our theoretical analysis, MultiOutput parallelization was found to be impractical, i.e., slower than multi-threading on a single machine. In contrast, the Batch-MultiOutput regression scaled well across compute nodes and threads, providing speed-ups of up to 33 times with 8 compute nodes and 32 threads compared to a single-threaded scikit-learn execution. Batch parallelization using Dask thus emerges as a scalable approach for brain encoding with ridge regression on high-performance computing systems using scikit-learn and large fMRI datasets.

Scaling up ridge regression for brain encoding in a massive individual fMRI dataset

TL;DR

Batch parallelization using Dask emerges as a scalable approach for brain encoding with ridge regression on high-performance computing systems using scikit-learn and large fMRI datasets.

Abstract

Brain encoding with neuroimaging data is an established analysis aimed at predicting human brain activity directly from complex stimuli features such as movie frames. Typically, these features are the latent space representation from an artificial neural network, and the stimuli are image, audio, or text inputs. Ridge regression is a popular prediction model for brain encoding due to its good out-of-sample generalization performance. However, training a ridge regression model can be highly time-consuming when dealing with large-scale deep functional magnetic resonance imaging (fMRI) datasets that include many space-time samples of brain activity. This paper evaluates different parallelization techniques to reduce the training time of brain encoding with ridge regression on the CNeuroMod Friends dataset, one of the largest deep fMRI resource currently available. With multi-threading, our results show that the Intel Math Kernel Library (MKL) significantly outperforms the OpenBLAS library, being 1.9 times faster using 32 threads on a single machine. We then evaluated the Dask multi-CPU implementation of ridge regression readily available in scikit-learn (MultiOutput), and we proposed a new "batch" version of Dask parallelization, motivated by a time complexity analysis. In line with our theoretical analysis, MultiOutput parallelization was found to be impractical, i.e., slower than multi-threading on a single machine. In contrast, the Batch-MultiOutput regression scaled well across compute nodes and threads, providing speed-ups of up to 33 times with 8 compute nodes and 32 threads compared to a single-threaded scikit-learn execution. Batch parallelization using Dask thus emerges as a scalable approach for brain encoding with ridge regression on high-performance computing systems using scikit-learn and large fMRI datasets.
Paper Structure (34 sections, 14 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 34 sections, 14 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: The two main steps of brain encoding: Extracting features from movie frames using VGG16 pretrained model and predicting brain response using ridge regression.
  • Figure 2: Mutilthreading and Distributed parallelism in scikit-learn's ridge regression
  • Figure 3: Matrix computations in Multi-threading ridgeCV, MOR and B-MOR model fitting. Assuming $X \in \mathbb{R} ^{n \times p}$, $Y \in \mathbb{R} ^ {n \times t}$ and $X= USV^T$ then the weight matrix $B \in \mathbb{R} ^ {p \times t}$ equals to $B = V (S^2 + \lambda I_{p}) ^ {\hbox{[}1.0]{$-$}1} S U^{T} Y$.
  • Figure 4: Brain encoding results, with performance based on Pearson Correlation Coefficient (r) between real and predicted time series in the friends dataset (N=6 subjects).
  • Figure 5: Brain encoding predictions for a single individual (sub-01) in two cases. Panel a: corresponding pairs of {fMRI time series and stimuli} were presented to the ridge regression models. Panel b: random permutations of fMRI time series and stimuli data were presented to the ridge regression model.
  • ...and 5 more figures