Table of Contents
Fetching ...

UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models

Ruchao Fan, Natarajan Balaji Shanka, Abeer Alwan

TL;DR

The paper addresses efficient non-autoregressive ASR by enabling encoder-only architectures that can leverage speech foundation models. It introduces UniEnc-CASSNAT, an encoder-only NASR that mimics the CASS-NAT encoder-decoder dynamics through two passes and a TA E extraction process, allowing the encoder to perform both frame-level representation learning and token-level context modeling. A multi-pass CTC (MP-CTC) training scheme with iterative decoding refines the token-level embeddings (TAEs) and improves WER, achieving state-of-the-art NASR results on Librispeech-100h and MyST with fewer parameters than CASS-NAT. The approach yields fast inference compared to autoregressive models while preserving competitive NASR performance, highlighting the practical value of encoder-only designs initialized from speech foundation models. Future work includes further compression and distillation to enhance on-device deployment.

Abstract

Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.

UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models

TL;DR

The paper addresses efficient non-autoregressive ASR by enabling encoder-only architectures that can leverage speech foundation models. It introduces UniEnc-CASSNAT, an encoder-only NASR that mimics the CASS-NAT encoder-decoder dynamics through two passes and a TA E extraction process, allowing the encoder to perform both frame-level representation learning and token-level context modeling. A multi-pass CTC (MP-CTC) training scheme with iterative decoding refines the token-level embeddings (TAEs) and improves WER, achieving state-of-the-art NASR results on Librispeech-100h and MyST with fewer parameters than CASS-NAT. The approach yields fast inference compared to autoregressive models while preserving competitive NASR performance, highlighting the practical value of encoder-only designs initialized from speech foundation models. Future work includes further compression and distillation to enhance on-device deployment.

Abstract

Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.
Paper Structure (12 sections, 3 equations, 1 figure, 2 tables)

This paper contains 12 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: (a): the diagram of CASS-NAT. (b): the proposed UniEnc-CASSNAT. HuBERT conv. and contextual encoders are used. The TAE extractor is a self-attention module that transforms the acoustic representations with length T to TAEs with length U. The generation of TAEs and second pass forward computation are repeated during iterative decoding.