Table of Contents
Fetching ...

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

Wei Zhang, Tian-Hao Zhang, Chao Luo, Hui Zhou, Chao Yang, Xinyuan Qian, Xu-Cheng Yin

TL;DR

This work tackles the latency of WFST-based decoding in CTC-based ASR by introducing Spike Window Decoding (SWD), which forms spike-centered windows around non-blank CTC spikes to drastically reduce the number of frames processed by the WFST. By combining a spike-aware windowing strategy with a weight-pushing optimization in the TLG graph, the method preserves recognition accuracy while significantly speeding up inference. Empirical results on AISHELL-1 and a large 43k-hour In-House Mandarin dataset show state-of-the-art CERs (approximately 3.89% on AISHELL-1 and 2.09% on In-House) and speedups up to around 2.17× over dense WFST decoding, confirming both efficacy and scalability. The findings also indicate that frames neighboring non-blank spikes carry meaningful information, supporting a new paradigm for integrating CTC outputs with WFSTs in fast, accurate ASR systems.

Abstract

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.

Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition

TL;DR

This work tackles the latency of WFST-based decoding in CTC-based ASR by introducing Spike Window Decoding (SWD), which forms spike-centered windows around non-blank CTC spikes to drastically reduce the number of frames processed by the WFST. By combining a spike-aware windowing strategy with a weight-pushing optimization in the TLG graph, the method preserves recognition accuracy while significantly speeding up inference. Empirical results on AISHELL-1 and a large 43k-hour In-House Mandarin dataset show state-of-the-art CERs (approximately 3.89% on AISHELL-1 and 2.09% on In-House) and speedups up to around 2.17× over dense WFST decoding, confirming both efficacy and scalability. The findings also indicate that frames neighboring non-blank spikes carry meaningful information, supporting a new paradigm for integrating CTC outputs with WFSTs in fast, accurate ASR systems.

Abstract

Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.
Paper Structure (19 sections, 6 equations, 1 figure, 3 tables)

This paper contains 19 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: (a) represents a search performed using the dense frames, (b) and (c) indicate a neighboring frames of 2, while (d) indicates that the neighboring frame is 1.The dashed blue line signifies that only the left side is neighboring, whereas the red line denotes that only the right side is neighboring. The solid blue and red lines indicate that the left and right non-blank sides are neighboring simutaneously.