Self-Supervised Learning for Multi-Channel Neural Transducer

Atsushi Kojima

Self-Supervised Learning for Multi-Channel Neural Transducer

Atsushi Kojima

TL;DR

This work extends self-supervised learning with wav2vec 2.0 to a multi-channel end-to-end ASR model, specifically a multi-channel neural transducer, and investigates three quantization strategies for pre-training. Feature-wise quantization emerges as the most effective, delivering a 66% relative CER reduction on a far-field in-house dataset and up to 4.2% CER / 2.4% WER reductions on CHiME-4. The study demonstrates that channel-wise and joint quantization are less effective, and that cross-channel attention combined with feature-wise quantization improves robustness to noise and reverberation. Overall, the results support the feasibility and practicality of wav2vec 2.0-style self-supervised pre-training for multi-channel end-to-end ASR and provide guidance on quantization design for such models.

Abstract

Self-supervised learning, such as with the wav2vec 2.0 framework significantly improves the accuracy of end-to-end automatic speech recognition (ASR). Wav2vec 2.0 has been applied to single-channel end-to-end ASR models. In this work, we explored a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework. As the multi-channel end-to-end ASR model, we focused on a multi-channel neural transducer. In pre-training, we compared three different methods for feature quantization to train a multi-channel conformer audio encoder: joint quantization, feature-wise quantization and channel-wise quantization. In fine-tuning, we trained the multi-channel conformer-transducer. All experiments were conducted using the far-field in-house and CHiME-4 datasets. The results of the experiments showed that feature-wise quantization was the most effective among the methods. We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.

Self-Supervised Learning for Multi-Channel Neural Transducer

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 5 figures, 4 tables)

This paper contains 16 sections, 4 equations, 5 figures, 4 tables.

Introduction
Background
Multi-channel neural transducer
Self-supervised learning based on wav2vec 2.0 framework
Self-supervised learning for multi-channel neural transducer
Joint quantization
Feature-wise quantization
Channel-wise quantization
Experiments
Data preparation
Model details
Results
Far-field in-house dataset
CHiME-4 dataset
Analysis of hidden vectors
...and 1 more sections

Figures (5)

Figure 1: Architecture of multi-channel neural transducer in the case of two channels.
Figure 2: Joint quantization in the case of two channels.
Figure 3: Feature-wise quantization.
Figure 4: Channel-wise quantization.
Figure 5: Analysis of hidden vectors after self-supervised learning.

Self-Supervised Learning for Multi-Channel Neural Transducer

TL;DR

Abstract

Self-Supervised Learning for Multi-Channel Neural Transducer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)