Table of Contents
Fetching ...

Refining Self-Supervised Learnt Speech Representation using Brain Activations

Hengyu Li, Kangdi Mei, Zhaoci Liu, Yang Ai, Liping Chen, Jie Zhang, Zhenhua Ling

TL;DR

This work addresses whether self-supervised speech models can be further improved by leveraging neural activations from human brain processing. It introduces a neural-encoding head on top of wav2vec2.0 to predict fMRI BOLD responses from speech, using a two-stage training regime with $L_2$-regularized $MSE$ and a history of $n=6$ (about $9\,\text{s}$) of audio as input. Evaluated on SUPERB, the refined model shows improvements on multiple downstream tasks (e.g., PR, IC, SID, ASR, ASV), with some tasks remaining unchanged or degraded when domain shift occurs (stimuli-pretrain). The results demonstrate a feasible neuroscience-driven path to optimize SSL speech models and motivate future work combining diverse brain signals and low-SNR scenarios to further enhance performance.

Abstract

It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models.

Refining Self-Supervised Learnt Speech Representation using Brain Activations

TL;DR

This work addresses whether self-supervised speech models can be further improved by leveraging neural activations from human brain processing. It introduces a neural-encoding head on top of wav2vec2.0 to predict fMRI BOLD responses from speech, using a two-stage training regime with -regularized and a history of (about ) of audio as input. Evaluated on SUPERB, the refined model shows improvements on multiple downstream tasks (e.g., PR, IC, SID, ASR, ASV), with some tasks remaining unchanged or degraded when domain shift occurs (stimuli-pretrain). The results demonstrate a feasible neuroscience-driven path to optimize SSL speech models and motivate future work combining diverse brain signals and low-SNR scenarios to further enhance performance.

Abstract

It was shown in literature that speech representations extracted by self-supervised pre-trained models exhibit similarities with brain activations of human for speech perception and fine-tuning speech representation models on downstream tasks can further improve the similarity. However, it still remains unclear if this similarity can be used to optimize the pre-trained speech models. In this work, we therefore propose to use the brain activations recorded by fMRI to refine the often-used wav2vec2.0 model by aligning model representations toward human neural responses. Experimental results on SUPERB reveal that this operation is beneficial for several downstream tasks, e.g., speaker verification, automatic speech recognition, intent classification.One can then consider the proposed method as a new alternative to improve self-supervised speech models.
Paper Structure (13 sections, 1 equation, 4 figures, 1 table)

This paper contains 13 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Flowchart of refining wav2vec2.0 using brain signals.
  • Figure 2: The percentages of parameter changes after refining different parameter types at different model layers.
  • Figure 3: Analytical results of (a) the length of input history waveforms and (b) predicting brain activations.
  • Figure 4: Analytical results of layer weights.