HAINAN: Fast and Accurate Transducer for Hybrid-Autoregressive ASR
Hainan Xu, Travis M. Bartley, Vladimir Bataev, Boris Ginsburg
TL;DR
HAINAN addresses the speed/accuracy trade-off in end-to-end ASR by unifying autoregressive and non-autoregressive inference in a single Token-and-Duration Transducer through predictor masking during training. It enables AR, NAR, and a novel semi-autoregressive (SAR) mode, plus Viterbi decoding, achieving AR accuracy on par with RNN-T/TDT and NAR accuracy surpassing CTC, with speeds comparable to fast non-autoregressive methods. Key contributions include a simple one-line training change, SAR refinement, and a DAG-based Viterbi decoder, validated across English and German datasets with large-scale encoder backbones. The work demonstrates a flexible, real-world-friendly approach to balancing accuracy and latency and plans to release open-source implementations and checkpoints.
Abstract
We present Hybrid-Autoregressive INference TrANsducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.
