Label-Looping: Highly Efficient Decoding for Transducers
Vladimir Bataev, Hainan Xu, Daniel Galvez, Vitaly Lavrukhin, Boris Ginsburg
TL;DR
The paper tackles inefficiency in greedy Transducer decoding by introducing label-looping, which splits blank and non-blank emission processing to maximize parallelism and employs a CUDA-based BatchedHyps data structure for batched partial hypotheses. The approach is demonstrated on both RNNT and TDT models, achieving up to 2.0X speedups at batch size 32 (and up to 3.8X non-encoder gains in some TDT cases), with further potential when combined with compiler and GPU-optimization techniques. A key contribution is the precomputation of encoder/predictor projections, yielding additional decoding-speed improvements, and the method remains open-source within the NeMo toolkit. The results indicate strong practical impact for faster Transducer inference, enabling larger predictors and real-time deployment on modern GPUs, while remaining compatible with TorchScript and CUDA Graphs for further acceleration.
Abstract
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models. We redesign the standard nested-loop design for RNN-T decoding, swapping loops over frames and labels: the outer loop iterates over labels, while the inner loop iterates over frames searching for the next non-blank symbol. Additionally, we represent partial hypotheses in a special structure using CUDA tensors, supporting parallelized hypotheses manipulations. Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32. It can be further combined with other compiler or GPU call-related techniques to achieve even more speedup. Our algorithm is general-purpose and can work with both conventional Transducers and Token-and-Duration Transducers. We open-source our implementation to benefit the research community.
