Table of Contents
Fetching ...

Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

TL;DR

This work further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering to lead to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Abstract

Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Optimized Self-supervised Training with BEST-RQ for Speech Recognition

TL;DR

This work further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering to lead to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Abstract

Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on test-other using 6 codebooks. Furthermore, the proposed setup leads to faster convergence in pre-training and fine-tuning and additionally stabilizes the pre-training.

Paper Structure

This paper contains 21 sections, 3 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: BEST-RQ training setup: (left) shows the baseline setup of BEST-RQ proposed in chiu2022bestrq, (right) shows our proposed modifications, including KL-divergence as regularizing loss and multiple codebooks in pre-training.
  • Figure 2: Validation loss in pre-training for the baseline method and our proposed modified configuration.