Refining Self-Supervised Learnt Speech Representation using Brain Activations
Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling
TL;DR
This work addresses whether self-supervised speech models can be further improved by leveraging neural activations from human brain processing. It introduces a neural-encoding head on top of wav2vec2.0 to predict fMRI BOLD responses from speech, using a two-stage training regime with $L_2$-regularized $MSE$ and a history of $n=6$ (about $9\,\text{s}$) of audio as input. Evaluated on SUPERB, the refined model shows improvements on multiple downstream tasks (e.g., PR, IC, SID, ASR, ASV), with some tasks remaining unchanged or degraded when domain shift occurs (stimuli-pretrain). The results demonstrate a feasible neuroscience-driven path to optimize SSL speech models and motivate future work combining diverse brain signals and low-SNR scenarios to further enhance performance.
Abstract
It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models.
