Label-Looping: Highly Efficient Decoding for Transducers

Vladimir Bataev; Hainan Xu; Daniel Galvez; Vitaly Lavrukhin; Boris Ginsburg

Label-Looping: Highly Efficient Decoding for Transducers

Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg

TL;DR

The paper tackles inefficiency in greedy Transducer decoding by introducing label-looping, which splits blank and non-blank emission processing to maximize parallelism and employs a CUDA-based BatchedHyps data structure for batched partial hypotheses. The approach is demonstrated on both RNNT and TDT models, achieving up to 2.0X speedups at batch size 32 (and up to 3.8X non-encoder gains in some TDT cases), with further potential when combined with compiler and GPU-optimization techniques. A key contribution is the precomputation of encoder/predictor projections, yielding additional decoding-speed improvements, and the method remains open-source within the NeMo toolkit. The results indicate strong practical impact for faster Transducer inference, enabling larger predictors and real-time deployment on modern GPUs, while remaining compatible with TorchScript and CUDA Graphs for further acceleration.

Abstract

This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special structure using CUDA tensors, supporting parallelized hypotheses manipulations. Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32. It can be further combined with other compiler or GPU call-related techniques to achieve even more speedup. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. We open-source our implementation to benefit the research community.

Label-Looping: Highly Efficient Decoding for Transducers

TL;DR

Abstract

Paper Structure (13 sections, 2 figures, 5 tables, 3 algorithms)

This paper contains 13 sections, 2 figures, 5 tables, 3 algorithms.

Introduction
Background
Label-looping Decoding Algorithm
Representation of batched hypotheses
Label-looping algorithm for RNN-Transducers
Label-looping for Token-and-Duration Transducers
Precomputation of encoder/predictor projections
Experiments
Results with RNN Transducers
Results with Token-and-Duration Transducers
Analysis
Combining Label-Looping with TorchScript and CUDA Graphs
Conclusion

Figures (2)

Figure 1: Transducer Architecture
Figure 2: Frame-looping and label-looping decoding algorithms operations. The batch size is 2, and the length of encoder output is 4. Ground truth transcriptions are "CAT" and "DOG". Alignments: 'C $\langle b \rangle$$\langle b \rangle$ A T $\langle b \rangle$$\langle b \rangle$', '$\langle b \rangle$ D $\langle b \rangle$$\langle b \rangle$ O G $\langle b \rangle$'. $\emptyset$ symbol indicates unnecessary computations in the algorithm due to batched decoding.

Label-Looping: Highly Efficient Decoding for Transducers

TL;DR

Abstract

Label-Looping: Highly Efficient Decoding for Transducers

Authors

TL;DR

Abstract

Table of Contents

Figures (2)