Table of Contents
Fetching ...

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

Vahid Noroozi, Somshubra Majumdar, Ankur Kumar, Jagadeesh Balam, Boris Ginsburg

TL;DR

This work tackles the streaming ASR accuracy-latency trade-off by adapting FastConformer with a cache-based activation mechanism that turns the non-autoregressive encoder into autoregressive inference during deployment, while constraining left and right contexts to match training. It introduces a hybrid CTC/RNNT architecture sharing a single encoder, enabling faster convergence and improved accuracy, and demonstrates a caching scheme that avoids buffer-based recomputation. Experiments on LibriSpeech and NeMo ASRSET show superior accuracy and lower latency versus buffered streaming, with chunk-aware look-ahead outperforming regular look-ahead at the same latency. The approach is open-sourced and scales to multiple latencies via multi-lookahead training, making it practical for real-time, multi-domain streaming ASR deployments.

Abstract

In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.

Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

TL;DR

This work tackles the streaming ASR accuracy-latency trade-off by adapting FastConformer with a cache-based activation mechanism that turns the non-autoregressive encoder into autoregressive inference during deployment, while constraining left and right contexts to match training. It introduces a hybrid CTC/RNNT architecture sharing a single encoder, enabling faster convergence and improved accuracy, and demonstrates a caching scheme that avoids buffer-based recomputation. Experiments on LibriSpeech and NeMo ASRSET show superior accuracy and lower latency versus buffered streaming, with chunk-aware look-ahead outperforming regular look-ahead at the same latency. The approach is open-sourced and scales to multiple latencies via multi-lookahead training, making it practical for real-time, multi-domain streaming ASR deployments.

Abstract

In this paper, we propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We adapted the FastConformer architecture for streaming applications through: (1) constraining both the look-ahead and past contexts in the encoder, and (2) introducing an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. The proposed model is thoughtfully designed in a way to eliminate the accuracy disparity between the train and inference time which is common for many streaming models. Furthermore, our proposed encoder works with various decoder configurations including Connectionist Temporal Classification (CTC) and RNN-Transducer (RNNT) decoders. Additionally, we introduced a hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation. We evaluate the proposed model on LibriSpeech dataset and a multi-domain large scale dataset and demonstrate that it can achieve better accuracy with lower latency and inference time compared to a conventional buffered streaming model baseline. We also showed that training a model with multiple latencies can achieve better accuracy than single latency models while it enables us to support multiple latencies with a single model. Our experiments also showed the hybrid architecture would not only speedup the convergence of the CTC decoder but also improves the accuracy of streaming models compared to single decoder models.
Paper Structure (12 sections, 1 equation, 3 figures, 4 tables)

This paper contains 12 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Diagram of how context gets extended with multi-layer self-attention layers in regular look-ahead vs chunk-aware. Dependency on future frames increases for regular look-ahead in self-attention layers as we go deep in the network whereas it remains the same for chunk-aware approach.
  • Figure 2: Architecture of the hybrid CTC/RNNT model.
  • Figure 3: Caching schema of self-attention and convolution layers for consecutive chunks.