WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection
Hainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg
TL;DR
The paper tackles the bottleneck of frame-by-frame RNN-T decoding by introducing WIND, a Windowed Inference strategy that processes a window of frames in parallel to rapidly locate the next non-blank token without sacrificing accuracy. It presents three WIND variants: greedy WIND, batched WIND with label-looping, and a novel WIND beam-search, each designed to reduce decoding latency while preserving or improving accuracy. Empirical results across multiple models and datasets show up to 2.4X speedups in greedy mode and strong speed-accuracy advantages for WIND beam-search compared to existing methods such as ALSD and MAES, with negligible memory overhead and compatibility with CUDA-graph optimizations. The work offers a practical, open-source approach to deploying faster RNN-T inference in diverse ASR tasks, enabling more efficient real-time or low-latency applications.
Abstract
We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.
