Multi-blank Transducers for Speech Recognition

Hainan Xu; Fei Jia; Somshubra Majumdar; Shinji Watanabe; Boris Ginsburg

Multi-blank Transducers for Speech Recognition

Hainan Xu, Fei Jia, Somshubra Majumdar, Shinji Watanabe, Boris Ginsburg

TL;DR

This work tackles slow inference and training challenges in RNN-T by introducing multi-blank blanks that can consume multiple input frames, controlled by a blank duration set $\mathcal{N}$ with $1\in\mathcal{N}$. They derive a modified forward-backward algorithm and an inference procedure where big blanks advance the time index by $m$ frames, significantly speeding up decoding. To bias the model toward emitting big blanks, they apply logits under-normalization with $\sigma=0.05$, defining path weights that penalize longer emission sequences and favor shorter, duration-heavy paths. Empirical results on Librispeech and German MLS show substantial inference speedups (up to $+\,139.6\%$) and consistent WER improvements across languages, and the authors release their implementation in NVIDIA's NeMo toolkit.

Abstract

This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.

Multi-blank Transducers for Speech Recognition

TL;DR

This work tackles slow inference and training challenges in RNN-T by introducing multi-blank blanks that can consume multiple input frames, controlled by a blank duration set

with

. They derive a modified forward-backward algorithm and an inference procedure where big blanks advance the time index by

frames, significantly speeding up decoding. To bias the model toward emitting big blanks, they apply logits under-normalization with

, defining path weights that penalize longer emission sequences and favor shorter, duration-heavy paths. Empirical results on Librispeech and German MLS show substantial inference speedups (up to

) and consistent WER improvements across languages, and the authors release their implementation in NVIDIA's NeMo toolkit.

Abstract

Paper Structure (15 sections, 6 equations, 2 figures, 5 tables)

This paper contains 15 sections, 6 equations, 2 figures, 5 tables.

Introduction
Multi-blank RNN-T
Blank symbol in RNN-T
Multi-blank RNN-T
Forward-backward algorithm
Model inference
Logits Under-normalization
Experiments
Librispeech results
German ASR results
Analysis
Impact of under-normalization for training
Big blank emission frequency
Efficient batched inference for multi-blank transducers
Conclusion and Future Work

Figures (2)

Figure 1: Output probability lattice of a standard RNN-T model as in graves2012sequence and of a multi-blank RNNT.
Figure 2: Emission distribution on Librispeech test-other. UN means the model is trained with logits under-normalization.

Multi-blank Transducers for Speech Recognition

TL;DR

Abstract

Multi-blank Transducers for Speech Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)