Table of Contents
Fetching ...

SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution

Fangyuan Wang, Bo Xu, Bo Xu

TL;DR

SSCFormer tackles streaming ASR by enabling long-range context and efficient training through Sequentially Sampled Chunks (SSC) and Chunked Causal Convolution (C2Conv). The encoder alternates Chunk-C2Conv and SSC-C2Conv blocks, allowing cross-chunk interactions with linear-MHSA complexity and preserving streaming latency. On AISHELL-1, it achieves a CER of 5.33% and outperforms both chunk-wise and some time-restricted baselines, while maintaining training and inference efficiency comparable to other streaming models. This approach broadens the applicability of streaming Conformers and can extend to other architectures beyond SSCFormer.

Abstract

Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left context and chunk-wise future context. Evaluations on AISHELL-1 show that an End-to-End (E2E) CER 5.33% can achieve, which even outperforms a strong time-restricted baseline U2. Moreover, the chunk-wise MHSA computation in our model enables it to train with a large batch size and perform inference with linear complexity.

SSCFormer: Push the Limit of Chunk-wise Conformer for Streaming ASR Using Sequentially Sampled Chunks and Chunked Causal Convolution

TL;DR

SSCFormer tackles streaming ASR by enabling long-range context and efficient training through Sequentially Sampled Chunks (SSC) and Chunked Causal Convolution (C2Conv). The encoder alternates Chunk-C2Conv and SSC-C2Conv blocks, allowing cross-chunk interactions with linear-MHSA complexity and preserving streaming latency. On AISHELL-1, it achieves a CER of 5.33% and outperforms both chunk-wise and some time-restricted baselines, while maintaining training and inference efficiency comparable to other streaming models. This approach broadens the applicability of streaming Conformers and can extend to other architectures beyond SSCFormer.

Abstract

Currently, the chunk-wise schemes are often used to make Automatic Speech Recognition (ASR) models to support streaming deployment. However, existing approaches are unable to capture the global context, lack support for parallel training, or exhibit quadratic complexity for the computation of multi-head self-attention (MHSA). On the other side, the causal convolution, no future context used, has become the de facto module in streaming Conformer. In this paper, we propose SSCFormer to push the limit of chunk-wise Conformer for streaming ASR using the following two techniques: 1) A novel cross-chunks context generation method, named Sequential Sampling Chunk (SSC) scheme, to re-partition chunks from regular partitioned chunks to facilitate efficient long-term contextual interaction within local chunks. 2)The Chunked Causal Convolution (C2Conv) is designed to concurrently capture the left context and chunk-wise future context. Evaluations on AISHELL-1 show that an End-to-End (E2E) CER 5.33% can achieve, which even outperforms a strong time-restricted baseline U2. Moreover, the chunk-wise MHSA computation in our model enables it to train with a large batch size and perform inference with linear complexity.
Paper Structure (14 sections, 1 equation, 6 figures, 4 tables)

This paper contains 14 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustrative example of sequential sampling chunk partition scheme. In layer l, MHSA is computed in regular chunks. However, in the next layer (l+1), MHSA is computed in sequentially sampled chunks that enable cross-chunk interactions for tokens. $z_i$, $i \in \{0,1,..,11\}$, denotes a speech token.
  • Figure 2: (a) The overview of the SSCFormer encoder; (b) two successive Chunk-C2Conv Conformer and SSC-C2Conv Conformer blocks (N=6, notations see in Section II.B).
  • Figure 3: Illustration of (a) the efficient batch generation of sequentially sampled chunks and (b) the dynamic generation of attention mask for sequentially sampled chunks. In this example, the sequence length and chunk size are denoted by L and W, and are set to 12 and 4, respectively.
  • Figure 4: Attention masks of 3 chunks with chunk size 4 for (a) regular Chunk-MHSA, (b) SSC-MHSA, and (c) time-restricted MHSA.
  • Figure 5: Illustration of the computation process of Chunked Causal Convolution, where the kernel size of masked convolution is 3 and the chunk size is 4. The causal convolution can cross chunks, while the chunked convolution can attend to chunk-wise future tokens.
  • ...and 1 more figures