Table of Contents
Fetching ...

Teaching Wav2Vec2 the Language of the Brain

Tobias Fiedler, Leon Hermann, Florian Müller, Sarel Cohen, Peter Chin, Tobias Friedrich, Eilon Vaadia

TL;DR

This work addresses decoding speech from brain activity using minimal BCI data by transferring audio-based Wav2Vec2 representations to brain data. It replaces Wav2Vec2's audio feature extractor with a Brain Feature Extractor and evaluates three training setups across 45 BFE architectures, finding that Full Fine-Tuning with pre-trained Wav2Vec2 yields the best results (CER 18.54%, WER 30.97% without an LM). Latent analysis shows the transformer can partially align brain-derived representations with audio representations despite cross-domain distribution differences. The study demonstrates a feasible cross-domain transfer that can leverage abundant audio-model knowledge to enhance brain decoding for speech, with code available for replication.

Abstract

The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.

Teaching Wav2Vec2 the Language of the Brain

TL;DR

This work addresses decoding speech from brain activity using minimal BCI data by transferring audio-based Wav2Vec2 representations to brain data. It replaces Wav2Vec2's audio feature extractor with a Brain Feature Extractor and evaluates three training setups across 45 BFE architectures, finding that Full Fine-Tuning with pre-trained Wav2Vec2 yields the best results (CER 18.54%, WER 30.97% without an LM). Latent analysis shows the transformer can partially align brain-derived representations with audio representations despite cross-domain distribution differences. The study demonstrates a feasible cross-domain transfer that can leverage abundant audio-model knowledge to enhance brain decoding for speech, with code available for replication.

Abstract

The decoding of continuously spoken speech from neuronal activity has the potential to become an important clinical solution for paralyzed patients. Deep Learning Brain Computer Interfaces (BCIs) have recently successfully mapped neuronal activity to text contents in subjects who attempted to formulate speech. However, only small BCI datasets are available. In contrast, labeled data and pre-trained models for the closely related task of speech recognition from audio are widely available. One such model is Wav2Vec2 which has been trained in a self-supervised fashion to create meaningful representations of speech audio data. In this study, we show that patterns learned by Wav2Vec2 are transferable to brain data. Specifically, we replace its audio feature extractor with an untrained Brain Feature Extractor (BFE) model. We then execute full fine-tuning with pre-trained weights for Wav2Vec2, training ''from scratch'' without pre-trained weights as well as freezing a pre-trained Wav2Vec2 and training only the BFE each for 45 different BFE architectures. Across these experiments, the best run is from full fine-tuning with pre-trained weights, achieving a Character Error Rate (CER) of 18.54\%, outperforming the best training from scratch run by 20.46\% and that of frozen Wav2Vec2 training by 15.92\% percentage points. These results indicate that knowledge transfer from audio speech recognition to brain decoding is possible and significantly improves brain decoding performance for the same architectures. Related source code is available at https://github.com/tfiedlerdev/Wav2Vec2ForBrain.
Paper Structure (10 sections, 4 figures, 1 table)

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: "Wav2Vec2 for CTC loss" architecture adapted to brain data. Here, the Wav2Vec2 feature extractor is replaced by our BFE, which is a GRU model followed by a fully connected projection network.
  • Figure 2: Conceptual depiction of experiment setups. Setup 1 and 2 load pre-trained weights for the Wav2Vec2 model, while Setup 3 loads random weights. Setup 2 freezes the Wav2Vec2 module parameters. In all setups, training runs with the same 45 BFE architectures are executed.
  • Figure 3: Comparison of the CER and WER achieved on the test set by the runs of the three experiment setups: (1) Full Fine-Tuning, (2) Frozen Wav2Vec2 training, and (3) Training from Scratch.
  • Figure 4: Features extracted by BFE (blue) and pre-trained Wav2Vec2 Audio Feature Extractor (red). Both depicted before and after being passed through the same pre-trained Wav2Vec2 transformer module. Reduced in dimension via t-SNE.