Table of Contents
Fetching ...

Improving Semantic Understanding in Speech Language Models via Brain-tuning

Omer Moussa, Dietrich Klakow, Mariya Toneva

TL;DR

This work introduces brain-tuning, a brain-informed fine-tuning approach that uses fMRI responses collected while participants listen to natural stories to inject brain-relevant semantics into pretrained speech models. By fine-tuning encoder components on audio-fMRI pairs, the authors demonstrate improved alignment with semantic brain regions and a reduced dependence on low-level speech features across three model families (Wav2vec2.0, HuBERT, Whisper), plus consistent gains on downstream semantic tasks. The method leverages a naturalistic dataset with voxel-wise noise ceilings to guide voxel selection and evaluates across brain alignment, feature ablation, and downstream performance, showing converging evidence that brain-guided training enhances semantic understanding beyond traditional fine-tuning. These findings suggest that incorporating brain signals during training can yield more brain-likeRepresentations and better semantic capabilities, potentially advancing AI systems as model organisms for auditory language processing. The work also opens avenues for scaling brain-tuning to larger models and broader brain datasets, with plans to release code and models publicly.

Abstract

Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories, a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models' semantic understanding.

Improving Semantic Understanding in Speech Language Models via Brain-tuning

TL;DR

This work introduces brain-tuning, a brain-informed fine-tuning approach that uses fMRI responses collected while participants listen to natural stories to inject brain-relevant semantics into pretrained speech models. By fine-tuning encoder components on audio-fMRI pairs, the authors demonstrate improved alignment with semantic brain regions and a reduced dependence on low-level speech features across three model families (Wav2vec2.0, HuBERT, Whisper), plus consistent gains on downstream semantic tasks. The method leverages a naturalistic dataset with voxel-wise noise ceilings to guide voxel selection and evaluates across brain alignment, feature ablation, and downstream performance, showing converging evidence that brain-guided training enhances semantic understanding beyond traditional fine-tuning. These findings suggest that incorporating brain signals during training can yield more brain-likeRepresentations and better semantic capabilities, potentially advancing AI systems as model organisms for auditory language processing. The work also opens avenues for scaling brain-tuning to larger models and broader brain datasets, with plans to release code and models publicly.

Abstract

Speech language models align with human brain responses to natural language to an impressive degree. However, current models rely heavily on low-level speech features, indicating they lack brain-relevant semantics which limits their utility as model organisms of semantic processing in the brain. In this work, we address this limitation by inducing brain-relevant bias directly into the models via fine-tuning with fMRI recordings of people listening to natural stories, a process we name brain-tuning. After testing it on 3 different pretrained model families, we show that brain-tuning not only improves overall alignment with new brain recordings in semantic language regions, but also reduces the reliance on low-level speech features for this alignment. Excitingly, we further show that brain-tuning leads to 1) consistent improvements in performance on a range of downstream tasks and 2) a representational space with increased semantic preference. Our results provide converging evidence, for the first time, that incorporating brain signals into the training of language models improves the models' semantic understanding.

Paper Structure

This paper contains 41 sections, 2 equations, 17 figures.

Figures (17)

  • Figure 1: Training and Evaluation Approaches. (a) Brain-tuning approach for a given speech model; (b) Evaluation of brain alignment and low-level feature impact on the brain alignment; (c) Types of evaluation and expected outcomes if brain-tuning successfully improves semantic understanding in speech models: increase of alignment with semantic brain regions, decrease of impact of low-level features on this alignment, and increase in downstream performance on semantic tasks.
  • Figure 2: (a), (b) Mean normalised brain alignment for different brain areas. Error bars indicate the standard error across participants, with * indicating significantly different alignment from pretrained. Brain-tuning significantly improves alignment with late language regions for the self-supervised models. (c) Voxel-wise differences in brain alignment between brain-tuned and pretrained Wav2vec2.0 for a representative participant. Higher alignment is observed in semantic areas.
  • Figure 3: (a), (b) Mean impact of low-level speech features (percentage drop in brain alignment) for different regions. Error bars indicate the standard error of the mean across participants, and * denotes significantly lower low-level impact than in the pretrained model. All models have significantly lower low-level impact in late language regions. (c) Voxel-wise differences in low-level impact between brain-tuned and pretrained Wav2vec2.0 for a representative participant.
  • Figure 4: Downstream task performance for different models. Brain-tuned models' performance is the mean and STE across participants. Brain-tuned models show consistent improvement over the baselines, with the biggest gains in more semantic tasks.
  • Figure 5: Brain alignment and low-level impact comparison with pretrained HuBERT large architecture on 3 subjects. The HuBERT base architecture performs closely to the large pretrained architecture and is less affected by removal of low-level features
  • ...and 12 more figures