Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Kunal Dhawan; Nithin Rao Koluguri; Ante Jukić; Ryan Langman; Jagadeesh Balam; Boris Ginsburg

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Kunal Dhawan, Nithin Rao Koluguri, Ante Jukić, Ryan Langman, Jagadeesh Balam, Boris Ginsburg

TL;DR

A codec ASR pipeline that outperforms Encodec at similar bit-rate and surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data is introduced.

Abstract

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint speech-text foundational models. In this work, we present a comprehensive analysis on building ASR systems with discrete codes. We investigate different methods for codec training such as quantization schemes and time-domain vs spectral feature encodings. We further explore ASR training techniques aimed at enhancing performance, training efficiency, and noise robustness. Drawing upon our findings, we introduce a codec ASR pipeline that outperforms Encodec at similar bit-rate. Remarkably, it also surpasses the state-of-the-art results achieved by strong self-supervised models on the 143 languages ML-SUPERB benchmark despite being smaller in size and pretrained on significantly less data.

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

TL;DR

Abstract

Paper Structure (24 sections, 2 figures, 4 tables)

This paper contains 24 sections, 2 figures, 4 tables.

Introduction
Speech recognition with audio codecs
Audio codecs
Quantization schemes
Time-domain NAC
Spectral NAC
Speech recognition pipeline
Embedding layer and codebook initialization
Code aggregation strategies
Spectrogram augmentation
Noisy embedding training
Experimental setup
NAC model training
ASR model training
Experiments and ablations
...and 9 more sections

Figures (2)

Figure 1: Architecture of the considered neural audio codecs.
Figure 2: The ASR with discrete codes pipeline.

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

TL;DR

Abstract

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (2)