Table of Contents
Fetching ...

Contrastive Learning to Fine-Tune Feature Extraction Models for the Visual Cortex

Alex Mulrooney, Austin J. Brockmeier

TL;DR

This work adapts contrastive learning (CL) to fine-tune a convolutional neural network, which was pretrained for image classification, such that a mapping of a given image's features are more similar to the corresponding fMRI response than to the responses to other images.

Abstract

Predicting the neural response to natural images in the visual cortex requires extracting relevant features from the images and relating those feature to the observed responses. In this work, we optimize the feature extraction in order to maximize the information shared between the image features and the neural response across voxels in a given region of interest (ROI) extracted from the BOLD signal measured by fMRI. We adapt contrastive learning (CL) to fine-tune a convolutional neural network, which was pretrained for image classification, such that a mapping of a given image's features are more similar to the corresponding fMRI response than to the responses to other images. We exploit the recently released Natural Scenes Dataset (Allen et al., 2022) as organized for the Algonauts Project (Gifford et al., 2023), which contains the high-resolution fMRI responses of eight subjects to tens of thousands of naturalistic images. We show that CL fine-tuning creates feature extraction models that enable higher encoding accuracy in early visual ROIs as compared to both the pretrained network and a baseline approach that uses a regression loss at the output of the network to tune it for fMRI response encoding. We investigate inter-subject transfer of the CL fine-tuned models, including subjects from another, lower-resolution dataset (Gong et al., 2023). We also pool subjects for fine-tuning to further improve the encoding performance. Finally, we examine the performance of the fine-tuned models on common image classification tasks, explore the landscape of ROI-specific models by applying dimensionality reduction on the Bhattacharya dissimilarity matrix created using the predictions on those tasks (Mao et al., 2024), and investigate lateralization of the processing for early visual ROIs using salience maps of the classifiers built on the CL-tuned models.

Contrastive Learning to Fine-Tune Feature Extraction Models for the Visual Cortex

TL;DR

This work adapts contrastive learning (CL) to fine-tune a convolutional neural network, which was pretrained for image classification, such that a mapping of a given image's features are more similar to the corresponding fMRI response than to the responses to other images.

Abstract

Predicting the neural response to natural images in the visual cortex requires extracting relevant features from the images and relating those feature to the observed responses. In this work, we optimize the feature extraction in order to maximize the information shared between the image features and the neural response across voxels in a given region of interest (ROI) extracted from the BOLD signal measured by fMRI. We adapt contrastive learning (CL) to fine-tune a convolutional neural network, which was pretrained for image classification, such that a mapping of a given image's features are more similar to the corresponding fMRI response than to the responses to other images. We exploit the recently released Natural Scenes Dataset (Allen et al., 2022) as organized for the Algonauts Project (Gifford et al., 2023), which contains the high-resolution fMRI responses of eight subjects to tens of thousands of naturalistic images. We show that CL fine-tuning creates feature extraction models that enable higher encoding accuracy in early visual ROIs as compared to both the pretrained network and a baseline approach that uses a regression loss at the output of the network to tune it for fMRI response encoding. We investigate inter-subject transfer of the CL fine-tuned models, including subjects from another, lower-resolution dataset (Gong et al., 2023). We also pool subjects for fine-tuning to further improve the encoding performance. Finally, we examine the performance of the fine-tuned models on common image classification tasks, explore the landscape of ROI-specific models by applying dimensionality reduction on the Bhattacharya dissimilarity matrix created using the predictions on those tasks (Mao et al., 2024), and investigate lateralization of the processing for early visual ROIs using salience maps of the classifiers built on the CL-tuned models.
Paper Structure (39 sections, 8 equations, 18 figures, 5 tables)

This paper contains 39 sections, 8 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of the proposed approach. (Green arrows) The brain response to natural images (GLM beta coefficients fit to the fMRI) are organized into visual ROIs. (Blue arrows) Constrastive learning (CL) is used to fine-tune the AlexNet CNN $f_A^r$ based on a loss $\mathcal{L}$ that maximizes the cosine similarity $\mathrm{sim}$ of processed image-response pairs $\mathbf{z}^{x,r},\mathbf{z}^{y,r}$, contrasted to random pairs $\mathbf{z}^{x',r},\mathbf{z}^{y,r}$ with $x'\neq x$ (not shown). $\mathbf{z^{x,r}}$ is obtained by passing the image $x$ through the AlexNet CNN $f_A^r$ and then a linear projection $\mathbf{W}^{x,r}$ and the shared non-linear projection head $g^r$. $\mathbf{z^{y,r}}$ is obtained by passing the fMRI response $\mathbf{y}^r$ through a linear projection $\mathbf{W}^{y,r}$ and then the shared non-linear projection head $g^r$. $f_A^r$ is updated by back-propagation using the gradient $\nabla_{f_A^r} \mathcal{L}$. (Purple arrows) An encoding model for the ROI uses the output from the pre-selected layer $l_r$ of the fine-tuned AlexNet, applies PCA, and then fits a $\ell_2$ penalized linear model via ridge regression. Performance of the encoding predictions are assessed on held-out images via the correlation coefficient averaged across an ROI's voxels $\bar{\rho}$.
  • Figure 2: Matrix of encoding scores for each ROI and AlexNet layer, where the entries are the average encoding accuracy over 5-fold cross validation when a particular AlexNet layer's activations (after PCA for dimensionality reduction) are used to predict a particular ROI (using the best penalty term from cross validation for the regularization parameter for ridge regression).
  • Figure 3: Voxel-wise differences in encoding performance $\rho$ visualized on FreeSurfer template for each subject. 'Hot' colors indicate improvements with contrastive learning (CL) based fine-tuning versus the pretrained AlexNet.
  • Figure 4: Cross-subject results for the early and higher visual ROI groups. The element in block $i,j$ is the percentage of improved when using subject $i$'s CL-tuned feature extraction model followed by PCA and $\ell_2$-penalized linear model to predict the fMRI responses for subject $j$ versus using features from the pretrained AlexNet CNN.
  • Figure 5: Comparison of percentage of voxels with improved encoding when using CL fine-tuning versus the untuned AlexNet baseline for cross-subject models and pooled models (with averaged embedding layer dimension) across all subjects and ROIs. The average cross-subject improvement for a given ROI on the $i$th subject is calculated as the mean percentage of improved voxels versus the untuned baseline across all other subjects $j \neq i$ when using the $j$th subject's fine-tuned model. The pooled improvement is the percentage of voxels with higher correlation versus the untuned baseline for a given subject and ROI when using the pooled model for that ROI.
  • ...and 13 more figures