Table of Contents
Fetching ...

Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants

Louis Jalouzot, Alexis Thual, Yair Lakretz, Christophe Pallier, Bertrand Thirion

TL;DR

The results suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode.

Abstract

We investigate optimal strategies for decoding perceived natural speech from fMRI data acquired from a limited number of participants. Leveraging Lebel et al. (2023)'s dataset of 8 participants, we first demonstrate the effectiveness of training deep neural networks to predict LLM-derived text representations from fMRI activity. Then, in this data regime, we observe that multi-subject training does not improve decoding accuracy compared to single-subject approach. Furthermore, training on similar or different stimuli across subjects has a negligible effect on decoding accuracy. Finally, we find that our decoders better model syntactic than semantic features, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode. While our results demonstrate the benefits of having extensive data per participant (deep phenotyping), they suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort.

Optimizing fMRI Data Acquisition for Decoding Natural Speech with Limited Participants

TL;DR

The results suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode.

Abstract

We investigate optimal strategies for decoding perceived natural speech from fMRI data acquired from a limited number of participants. Leveraging Lebel et al. (2023)'s dataset of 8 participants, we first demonstrate the effectiveness of training deep neural networks to predict LLM-derived text representations from fMRI activity. Then, in this data regime, we observe that multi-subject training does not improve decoding accuracy compared to single-subject approach. Furthermore, training on similar or different stimuli across subjects has a negligible effect on decoding accuracy. Finally, we find that our decoders better model syntactic than semantic features, and that stories containing sentences with complex syntax or rich semantic content are more challenging to decode. While our results demonstrate the benefits of having extensive data per participant (deep phenotyping), they suggest that leveraging multi-subject for natural speech decoding likely requires deeper phenotyping or a substantially larger cohort.

Paper Structure

This paper contains 22 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Method for decoding natural speech from fMRI activity A. Decoding setup Deep Neural Networks are trained with a contrastive objective to predict text representations (derived from Large Language Models embeddings) from fMRI activity recorded as participants listened to natural speech. Key parameters include context length$c$, the number of prior chunks added to the text representations, lag$\tau$, the delay between neural activity and the hemodynamic response, and smooth$\kappa$, the number of preceding brain volumes averaged. B. Subject approaches We compare single-subject (one decoder per subject) and multi-subject (shared decoder backbone with subject-specific layers at the bottom) approaches. C. Retrieval setup Decoders are evaluated in a retrieval setup, we rank chunks from a retrieval set (candidates) by the cosine similarity between their representation and the predicted one. Then we compute top-10 accuracy (frequency of the ground truth appearing among the top 10 candidates).
  • Figure 2: Impact of the amount of training data on single-subject performance Cross-validated top-10 accuracy of single-subject decoders trained on varying amounts of data. The retrieval set contains about 2k samples, which were acquired on different MRI sessions than the training data, and come from different stories than that of the training set. We display chance level performance (0.05%, grey) and the BrainLLM baseline (1.6%, black) for comparison. Note the x-axis break to display chance-level performance of untrained decoders.
  • Figure 3: Setup comparison Impact of various elements of the decoding setup on decoding performance. We start from a very crude version of our setup, namely "Base", which is essentially a simple MLP trained with MSE loss on BERT latents. Then each row corresponds to the previous setup with a modification described by its blue label. We display the top-10 accuracy obtained when training on subjects 1, 2 and 3 with SSLs and the full data. "Random" corresponds to a decoder that produces random representations, resulting in randomly ordered candidates in the retrieval set and thus achieving chance-level performance. We also display the BrainLLM baseline performance.
  • Figure 4: Impact of the number of subject used in the training set Multi-subject decoders were trained with subject-specific layers for each of the 255 possible combinations of the 8 subjects. Then for each subject (color) and each number of subjects (x-axis), we display the best accuracy (y-axis) obtained with any of the combinations including this subject. We test small decoders (left pane, hidden dimension 64) and large ones (right pane, hidden dimension 4096). Here we do not use the extra data available for the 3 first subjects.
  • Figure 5: Impact of training stimuli overlap We train multi-subject decoders with subject-specific layers on subjects 1, 2 and 3 while varying the ratio of overlapping stimuli between the subjects. The graphics display the increment in accuracy over single-subject decoders ($y$ axis $=$ Multi $-$ Single top-10 accuracy), for small decoders (left pane, hidden dimension 64) and large ones (right pane, hidden dimension 4096). For each of the 3 subjects we train on 1/3 of the available data ($\sim$ 4.5 hours) for each overlap to be possible (in particular 0). The decoders are tested on the same stimuli, no matter the overlap.
  • ...and 1 more figures