Table of Contents
Fetching ...

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

TL;DR

This work tackles the latency of autoregressive ASR by introducing a partially autoregressive decoding framework (PAR) that fuses greedy CTC reseeding with segment-level vectorized beam search to refine low-confidence tokens. By identifying masks via $P_{thres}$ and limiting iterations with $max\_iteration$, PAR achieves a practical speedup closer to non-autoregressive methods while preserving AR-like accuracy. The approach leverages a hybrid CTC/Attention model and a novel multi-mask beam search to parallelize decoding, resulting in up to 13.75× speedups on LibriSpeech with LS-960, without requiring additional model training. However, PAR can incur higher memory usage and may suffer if the initial gCTC predictions are highly erroneous, suggesting careful tuning of $P_{thres}$ and $max\_iteration$ for deployment. Overall, PAR offers a compelling trade-off between inference speed and recognition accuracy, expanding the practical applicability of hybrid CTC/Attention ASR models.

Abstract

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

TL;DR

This work tackles the latency of autoregressive ASR by introducing a partially autoregressive decoding framework (PAR) that fuses greedy CTC reseeding with segment-level vectorized beam search to refine low-confidence tokens. By identifying masks via and limiting iterations with , PAR achieves a practical speedup closer to non-autoregressive methods while preserving AR-like accuracy. The approach leverages a hybrid CTC/Attention model and a novel multi-mask beam search to parallelize decoding, resulting in up to 13.75× speedups on LibriSpeech with LS-960, without requiring additional model training. However, PAR can incur higher memory usage and may suffer if the initial gCTC predictions are highly erroneous, suggesting careful tuning of and for deployment. Overall, PAR offers a compelling trade-off between inference speed and recognition accuracy, expanding the practical applicability of hybrid CTC/Attention ASR models.

Abstract

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam search for improving the inference speed of an ASR model based on the hybrid connectionist temporal classification (CTC) attention-based architecture. It first generates an initial hypothesis using greedy CTC decoding, identifying low-confidence tokens based on their output probabilities. We then utilize the decoder to perform segment-level vectorized beam search on these tokens, re-predicting in parallel with minimal decoder calculations. Experimental results show that our method is 12 to 13 times faster in inference on the LibriSpeech corpus over AR decoding whilst preserving high accuracy.
Paper Structure (22 sections, 5 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview and comparison of AR, NAR, and PAR decoding. denotes the "start-of-sequence" symbol,and the mask token is denoted by or red characters. PAR is a hybrid of AR and NAR methods, in which the masking process is applied first, followed by segment-level vectorized beam search.
  • Figure 2: Average inference time and proportion of time spent on the encoder, decoder, and CTC computation for (a) AR, as well as the encoder and decoder computation for (b) NAR architectures.
  • Figure 5: Average inference time and proportion of time spent on the encoder and decoder computation during PAR decoding. The decoder's share is greatly reduced from AR.
  • Figure 6: The comparison of WER and RTF measured using the AR and PAR methods. We used the models trained with LS-100 and LS-960 datasets and measured by changing the beam size between $1$ and $20$.
  • Figure 7: The relationship between the WER and $P_{\mathrm{thres}}$. We evaluated by changing the $P_{\mathrm{thres}}$ from $0.95$ to $0.999$. We used the E-Branchformer-based pre-trained model for LS-960.