Table of Contents
Fetching ...

Multi-blank Transducers for Speech Recognition

Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

TL;DR

This work tackles slow inference and training challenges in RNN-T by introducing multi-blank blanks that can consume multiple input frames, controlled by a blank duration set $\mathcal{N}$ with $1\in\mathcal{N}$. They derive a modified forward-backward algorithm and an inference procedure where big blanks advance the time index by $m$ frames, significantly speeding up decoding. To bias the model toward emitting big blanks, they apply logits under-normalization with $\sigma=0.05$, defining path weights that penalize longer emission sequences and favor shorter, duration-heavy paths. Empirical results on Librispeech and German MLS show substantial inference speedups (up to $+\,139.6\%$) and consistent WER improvements across languages, and the authors release their implementation in NVIDIA's NeMo toolkit.

Abstract

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.

Multi-blank Transducers for Speech Recognition

TL;DR

This work tackles slow inference and training challenges in RNN-T by introducing multi-blank blanks that can consume multiple input frames, controlled by a blank duration set with . They derive a modified forward-backward algorithm and an inference procedure where big blanks advance the time index by frames, significantly speeding up decoding. To bias the model toward emitting big blanks, they apply logits under-normalization with , defining path weights that penalize longer emission sequences and favor shorter, duration-heavy paths. Empirical results on Librispeech and German MLS show substantial inference speedups (up to ) and consistent WER improvements across languages, and the authors release their implementation in NVIDIA's NeMo toolkit.

Abstract

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.
Paper Structure (15 sections, 6 equations, 2 figures, 5 tables)

This paper contains 15 sections, 6 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Output probability lattice of a standard RNN-T model as in graves2012sequence and of a multi-blank RNNT.
  • Figure 2: Emission distribution on Librispeech test-other. UN means the model is trained with logits under-normalization.