BrainWavLM: Fine-tuning Speech Representations with Brain Responses to Language
Nishitha Vattikonda, Aditya R. Vaidya, Richard J. Antonello, Alexander G. Huth
TL;DR
The paper addresses the limitation of linear brain encoding models by introducing BrainWavLM, an end-to-end, LoRA-fine-tuned WavLM-based encoder trained on brain responses to natural speech. The method optimizes a brain-encoding objective, with the loss defined as ${\mathcal L}(\theta_g,\theta_p) = -\frac{1}{T} \sum_{t=1}^{T} {\rm corr}_v(R_{t,:}, \hat{R}_{t,:})}$, while leveraging a low-rank adapter (LoRA) to update $W^Q$, $W^K$, and $W^V$ and a bottleneck readout to predict fMRI signals from the neural features. Key findings show cortex-wide fine-tuning yields substantial encoding gains (≈12.5% on average) and robust cross-subject generalization, though low-level auditory cortex (AC) may initially suffer unless AC is targeted for fine-tuning. Probing analyses reveal that brain-tuned models increasingly encode semantic representations (GloVe embeddings) with comparable semantic gains to explicit supervision, while AC-tuning preserves acoustic information. The work demonstrates that non-linear, brain-informed fine-tuning can produce robust, semantically enriched speech representations and suggests a pathway for training models with neural supervision without manual annotations, with LoRA enabling efficient, stable adaptation.
Abstract
Speech encoding models use auditory representations to predict how the human brain responds to spoken language stimuli. Most performant encoding models linearly map the hidden states of artificial neural networks to brain data, but this linear restriction may limit their effectiveness. In this work, we use low-rank adaptation (LoRA) to fine-tune a WavLM-based encoding model end-to-end on a brain encoding objective, producing a model we name BrainWavLM. We show that fine-tuning across all of cortex improves average encoding performance with greater stability than without LoRA. This improvement comes at the expense of low-level regions like auditory cortex (AC), but selectively fine-tuning on these areas improves performance in AC, while largely retaining gains made in the rest of cortex. Fine-tuned models generalized across subjects, indicating that they learned robust brain-like representations of the speech stimuli. Finally, by training linear probes, we showed that the brain data strengthened semantic representations in the speech model without any explicit annotations. Our results demonstrate that brain fine-tuning produces best-in-class speech encoding models, and that non-linear methods have the potential to bridge the gap between artificial and biological representations of semantics.
