Improving Semantic Understanding in Speech Language Models via Brain-tuning
Omer Moussa, Dietrich Klakow, Mariya Toneva
TL;DR
This work introduces brain-tuning, a brain-informed fine-tuning approach that uses fMRI responses collected while participants listen to natural stories to inject brain-relevant semantics into pretrained speech models. By fine-tuning encoder components on audio-fMRI pairs, the authors demonstrate improved alignment with semantic brain regions and a reduced dependence on low-level speech features across three model families (Wav2vec2.0, HuBERT, Whisper), plus consistent gains on downstream semantic tasks. The method leverages a naturalistic dataset with voxel-wise noise ceilings to guide voxel selection and evaluates across brain alignment, feature ablation, and downstream performance, showing converging evidence that brain-guided training enhances semantic understanding beyond traditional fine-tuning. These findings suggest that incorporating brain signals during training can yield more brain-likeRepresentations and better semantic capabilities, potentially advancing AI systems as model organisms for auditory language processing. The work also opens avenues for scaling brain-tuning to larger models and broader brain datasets, with plans to release code and models publicly.
Abstract
Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories, a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models' semantic understanding.
