Table of Contents
Fetching ...

Scaling laws for language encoding models in fMRI

Richard Antonello, Aditya Vaidya, Alexander G. Huth

TL;DR

This work tests whether scaling laws from language modeling extend to brain encoding by evaluating large open-source transformers (OPT, LLaMA) and audio models (HuBERT, WavLM, Whisper) as feature spaces for fMRI prediction. Both semantic and acoustic encodings scale roughly logarithmically with model size and training data, yielding meaningful gains across three subjects, with noise ceilings indicating near-maximal predictivity in several regions such as the precuneus and higher auditory cortex. A stacked regression approach further enhances auditory cortex predictions by unifying semantic and acoustic features. The findings imply that increasing scale in both models and data can yield highly effective brain-language encoders, offering pathways to decoding and deeper mechanistic understanding, and are supported by open data and code releases.

Abstract

Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar logarithmic behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.

Scaling laws for language encoding models in fMRI

TL;DR

This work tests whether scaling laws from language modeling extend to brain encoding by evaluating large open-source transformers (OPT, LLaMA) and audio models (HuBERT, WavLM, Whisper) as feature spaces for fMRI prediction. Both semantic and acoustic encodings scale roughly logarithmically with model size and training data, yielding meaningful gains across three subjects, with noise ceilings indicating near-maximal predictivity in several regions such as the precuneus and higher auditory cortex. A stacked regression approach further enhances auditory cortex predictions by unifying semantic and acoustic features. The findings imply that increasing scale in both models and data can yield highly effective brain-language encoders, offering pathways to decoding and deeper mechanistic understanding, and are supported by open data and code releases.

Abstract

Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar logarithmic behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.
Paper Structure (30 sections, 4 equations, 26 figures, 2 tables)

This paper contains 30 sections, 4 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: Scaling laws of Semantic and Speech Audio Encoding Models - Figures1a and 1b show logarithmic scaling of semantic encoding model performance with number of parameters and number of stories. Figure 1c shows average voxelwise $r^2$ for each layer of all tested models averaged across 3 subjects. Figures1d, 1e, and 1f show analogous results for speech audio models. Error bars for Figures 1b and 1e denote standard error across bootstraps. Error bars for Figures 1c and 1f denote SNR-normalized subject-axis standard error. $r^2$ is computed as $|r|*r$.
  • Figure 2: Large-scale encoding models - Performance of an encoding model built using OPT-30B on 20 hours of training data from a single subject. Surrounding plots show model predictions (red) against the average response (dashed black) over 10 separate trials (gray) on a held-out natural language test stimulus for selected voxels (Clockwise from bottom left: Well-predicted voxels from fusiform body area (FBA), Broca's area, precuneus, prefrontal cortex, and secondary auditory cortex.) Only voxels with $CC_{max} > 0.35$ are shown. (PFC = prefrontal cortex, PrCu = precuneus, AC = auditory cortex/Wernicke's area, AG = angular gyrus)
  • Figure 3: Noise Ceiling Analysis - Figure 3a: A two channel flatmap showing which ROIs remain poorly explained by an encoding model built from the 33rd layer of OPT30B. Voxels are less transparent if they have a higher idealized encoding performance ($CC_{max}$). Voxels are more yellow if they have high room for improvement, defined as the difference between the best possible encoding model and this model. Angular gyrus and some parts of prefrontal cortex are still poorly explained, while precuneus and higher auditory cortex are close to optimal. Figure 3b: A histogram of voxel correlations ($CC_{abs}$). Figure 3c: A histogram of normalized voxel correlations ($CC_{norm}$). (PFC = prefrontal cortex, PrCu = precuneus, AC = auditory cortex, AG = angular gyrus)
  • Figure 4: Stacked Regression - Figure 4a: A flatmap shows which regions of cortex improve when augmenting a semantic encoding model built from the 18th layer of LLaMA-33B with the layers of Whisper using stacked regression. Voxels used the stacked regression if the stacked regression performed better on a validation set. The effect is highly localized to auditory cortex. Figure 4b: A butterfly plot comparing the voxelwise encoding performance of the stacked regression encoding model to the baseline semantic model. Figure 4c: The center-of-mass of the stacked regression attributions, $\mathcal{C}(\boldsymbol{\alpha}^{v,s})$ are visualized in auditory cortex. Figure 4d: The improvement in encoding performance of the stacked regression model over the baseline is visualized in auditory cortex.
  • Figure A.1: Parametric voxelwise scaling laws computed using the OPT language model family. Flatmaps show the constant of proportionality of encoding performance for for model size scaling. Model size increases in semantic models seem to be most beneficial for predicting amodal, post-auditory cognitive areas such as prefrontal cortex.
  • ...and 21 more figures