Table of Contents
Fetching ...

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

TL;DR

The approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data.

Abstract

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

TL;DR

The approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data.

Abstract

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.
Paper Structure (24 sections, 2 equations, 6 figures, 1 table)

This paper contains 24 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our pipeline. Top pane: In the GTZan fMRI experiment, five participants were exposed to auditory stimuli that included multiple musical tracks while their brain activity was monitored via functional MRI. This setup captures the direct neural response to complex auditory inputs. In the middle pane, our encoding pipeline is described: Starting from the music stimulus, we first obtain its latent representation using the CLAP model. Subsequently, we develop voxel-wise encoding models to map the brain's response to these stimuli to this latent space. A threshold is then applied to the voxel-wise correlation between real and predicted brain activities to identify brain regions whose activity allows the best decoding of musical stimuli. These regions are considered as most responsive to music-related regions of interest (ROIs). The bottom pane outlines our decoding pipeline, which is primarily retrieval-based. We train a model that inputs brain activity from the previously identified ROIs and predicts the corresponding CLAP features. Using these features, we then search within the CLAP latent space for the closest musical stimulus, selecting the nearest k (k=5) stimulus as our retrieved samples.
  • Figure 2: Two-dimensional t-SNE representation of CLAP latent representations of music, coloured by different musical genres.
  • Figure 3: Regions of interest (ROIs) corresponding to musically responsive areas were identified by applying a threshold to the correlations between predicted and actual brain activity. This process was part of a cross-validation procedure used in the encoding models.
  • Figure 4: Confusion matrix showing our model's accuracy (number of correct predictions over the number of total predictions) in classifying musical genres based on fMRI data from five participants. Diagonal elements represent correct predictions for each genre, while off-diagonal elements indicate misclassifications. Each genre has 30 music stimuli, evenly distributed across the subjects; a value of 30 in the main diagonal therefore represents 100% accuracy. The model performs well for classical, jazz, and pop genres, with minimal confusion, while disco and metal genres show higher misclassification rates, likely due to overlapping music features. The matrix highlights the effectiveness of the cross-subject decoding pipeline and areas for improvement.
  • Figure 5: Time-frequency Decompositions (TFDs - used as illustrative visual aids to estimate similarity between audio data) of original musical stimuli (jazz and metal) and the stimuli decoded from the top-3 CLAP embeddings predicted using the Ridge regression decoding model. The left side displays the TFD of the original jazz stimulus, while the right side shows the TFDs of the original metal stimulus. Below each original stimulus, the top-3 predicted stimuli are shown. For the jazz stimulus, the predicted simuli were all identified as jazz. For the metal stimulus, the top-3 predictions included two metal and one rock embedding. This comparison highlights the model's ability to accurately predict musical genres from brain activity, while also illustrating occasional genre misclassification, particularly in more complex or overlapping genre spaces.
  • ...and 1 more figures