Table of Contents
Fetching ...

BrainWavLM: Fine-tuning Speech Representations with Brain Responses to Language

Nishitha Vattikonda, Aditya R. Vaidya, Richard J. Antonello, Alexander G. Huth

TL;DR

The paper addresses the limitation of linear brain encoding models by introducing BrainWavLM, an end-to-end, LoRA-fine-tuned WavLM-based encoder trained on brain responses to natural speech. The method optimizes a brain-encoding objective, with the loss defined as ${\mathcal L}(\theta_g,\theta_p) = -\frac{1}{T} \sum_{t=1}^{T} {\rm corr}_v(R_{t,:}, \hat{R}_{t,:})}$, while leveraging a low-rank adapter (LoRA) to update $W^Q$, $W^K$, and $W^V$ and a bottleneck readout to predict fMRI signals from the neural features. Key findings show cortex-wide fine-tuning yields substantial encoding gains (≈12.5% on average) and robust cross-subject generalization, though low-level auditory cortex (AC) may initially suffer unless AC is targeted for fine-tuning. Probing analyses reveal that brain-tuned models increasingly encode semantic representations (GloVe embeddings) with comparable semantic gains to explicit supervision, while AC-tuning preserves acoustic information. The work demonstrates that non-linear, brain-informed fine-tuning can produce robust, semantically enriched speech representations and suggests a pathway for training models with neural supervision without manual annotations, with LoRA enabling efficient, stable adaptation.

Abstract

Speech encoding models use auditory representations to predict how the human brain responds to spoken language stimuli. Most performant encoding models linearly map the hidden states of artificial neural networks to brain data, but this linear restriction may limit their effectiveness. In this work, we use low-rank adaptation (LoRA) to fine-tune a WavLM-based encoding model end-to-end on a brain encoding objective, producing a model we name BrainWavLM. We show that fine-tuning across all of cortex improves average encoding performance with greater stability than without LoRA. This improvement comes at the expense of low-level regions like auditory cortex (AC), but selectively fine-tuning on these areas improves performance in AC, while largely retaining gains made in the rest of cortex. Fine-tuned models generalized across subjects, indicating that they learned robust brain-like representations of the speech stimuli. Finally, by training linear probes, we showed that the brain data strengthened semantic representations in the speech model without any explicit annotations. Our results demonstrate that brain fine-tuning produces best-in-class speech encoding models, and that non-linear methods have the potential to bridge the gap between artificial and biological representations of semantics.

BrainWavLM: Fine-tuning Speech Representations with Brain Responses to Language

TL;DR

The paper addresses the limitation of linear brain encoding models by introducing BrainWavLM, an end-to-end, LoRA-fine-tuned WavLM-based encoder trained on brain responses to natural speech. The method optimizes a brain-encoding objective, with the loss defined as , while leveraging a low-rank adapter (LoRA) to update , , and and a bottleneck readout to predict fMRI signals from the neural features. Key findings show cortex-wide fine-tuning yields substantial encoding gains (≈12.5% on average) and robust cross-subject generalization, though low-level auditory cortex (AC) may initially suffer unless AC is targeted for fine-tuning. Probing analyses reveal that brain-tuned models increasingly encode semantic representations (GloVe embeddings) with comparable semantic gains to explicit supervision, while AC-tuning preserves acoustic information. The work demonstrates that non-linear, brain-informed fine-tuning can produce robust, semantically enriched speech representations and suggests a pathway for training models with neural supervision without manual annotations, with LoRA enabling efficient, stable adaptation.

Abstract

Speech encoding models use auditory representations to predict how the human brain responds to spoken language stimuli. Most performant encoding models linearly map the hidden states of artificial neural networks to brain data, but this linear restriction may limit their effectiveness. In this work, we use low-rank adaptation (LoRA) to fine-tune a WavLM-based encoding model end-to-end on a brain encoding objective, producing a model we name BrainWavLM. We show that fine-tuning across all of cortex improves average encoding performance with greater stability than without LoRA. This improvement comes at the expense of low-level regions like auditory cortex (AC), but selectively fine-tuning on these areas improves performance in AC, while largely retaining gains made in the rest of cortex. Fine-tuned models generalized across subjects, indicating that they learned robust brain-like representations of the speech stimuli. Finally, by training linear probes, we showed that the brain data strengthened semantic representations in the speech model without any explicit annotations. Our results demonstrate that brain fine-tuning produces best-in-class speech encoding models, and that non-linear methods have the potential to bridge the gap between artificial and biological representations of semantics.

Paper Structure

This paper contains 18 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: Encoding performance of BrainWavLM models fine-tuned on fMRI responses. (A) Cortical map of the change in encoding performance from the pre-trained WavLM model to the highest-performing BrainWavLM model (measured by performance on the validation set) on the test set. Corresponds to model fine-tuned with LoRA in (B). Results shown are for subject S03. (B) Encoding performance on the test set for the pre-trained model, model fine-tuned on LLaMA features with LoRA, model fine-tuned on fMRI data with LoRA, and model fine-tuned on fMRI data without LoRA, averaged across voxels and then subjects. Error bars show the standard error of the mean (SEM) for the per-subject performance, corrected for the overall performance by subtracting each subject's mean performance across the models from each model's performance. (C) Percent change in validation encoding performance through model fine-tuning with LoRA. The biggest improvements are found in the first 10 epochs, after which the performance stabilizes. (D) Change in validation encoding performance for the models fine-tuned with and without LoRA on subject S03. Performance for the model fine-tuned with LoRA was more stable during training. Validation performance for additional subjects is shown in Fig. \ref{['fig:app-subject-valperf']} in \ref{['app:subject-valperf']}.
  • Figure 2: Models adapt to representations in subsets of cortex. (A) Encoding performance was computed for models fine-tuned on the whole cortex or fine-tuned just on auditory cortex (AC). Cortical maps show the difference in encoding performance on one subject. Only voxels with encoding performance above $0.15$ for the pre-trained model are shown. The model fine-tuned on AC has higher performance in language-selective areas in the temporal and frontal lobes fedorenkoFunctionalSpecificityHighlevel2011lipkinProbabilisticAtlasLanguage2022. (B) Average percent improvement in encoding performance from the pre-trained WavLM Base+ model to the fine-tuned BrainWavLM models, either computed within AC or across the rest of cortex. One model was fine-tuned to predict features from LLaMA, and three models were fine-tuned on fMRI responses from either the whole cortex (81K--95K voxels), only auditory cortex (1.3K--2.7K voxels), or the whole cortex except auditory cortex (79K--93K voxels). Error bars show the SEM for the per-subject performance. For both brain areas, fMRI-tuned models were better at predicting fMRI responses than the LLaMA-tuned model or the pre-trained WavLM model. The best model for AC was fine-tuned on AC, and the best model for the rest of cortex was fine-tuned on the rest of cortex. (C) Separate models were fine-tuned using only left- or right-hemisphere voxels. Cortical maps show the difference in performance. Voxels are filtered with the same condition as in (A). Similar maps for other subjects are shown in Fig. \ref{['fig:app-hemis-flatmaps']} in \ref{['app:roi-flatmaps']}. (D) Percent improvement in encoding performance from the pre-trained model to the left- and right-hemisphere fine-tuned models, with models from (B) for comparison. Left-hemisphere fine-tuned models were marginally better across cortex than those fine-tuned on the right-hemisphere, though both were substantially better than the pre-trained model.
  • Figure 3: Fine-tuned models transfer between fMRI subjects. Percent improvement in encoding performance (averaged across voxels) of the models fine-tuned using one subject's fMRI responses compared to the pre-trained model. (A) Models were fine-tuned using fMRI responses from the whole cortex. (B) Models were fine-tuned using fMRI responses from auditory cortex (AC). Performance was only measured on the voxels within auditory cortex. (C) Models were fine-tuned using fMRI responses from the whole cortex except AC. Performance was only measured on the voxels outside auditory cortex. In all cases we see substantial generalization between subjects, with improvements of 7--14% outside AC but only 1--6% inside AC.
  • Figure 4: Model representations change after fine-tuning. Probe improvement over the pre-trained model, averaged across subjects. Transformer layer 0 is WavLM's convolutional waveform encoder. Error bars indicate SEM across subjects. (A) Acoustic probes linearly predict filterbank features of the stimulus waveform. Models fine-tuned on the whole cortex or on LLaMA became less acoustic, whereas the middle layers of AC-tuned models became more acoustic. (B) Semantic probes linearly predict GloVe embeddings of the time-aligned transcript. GloVe probes had the same performance for fMRI-tuned and LLaMA-tuned models, suggesting that the fMRI data is an effective source of semantics. Un-averaged probe performance can be seen in Fig. \ref{['fig:app-subject-probing']} in \ref{['app:subject-probing']}.
  • Figure 5: Percent Improvement in Encoding Performance for Models Fine-tuned with and without LoRA for subjects S01 and S02. Subject S03 is shown in Fig. \ref{['fig:lora-encperf']}D.
  • ...and 3 more figures