Table of Contents
Fetching ...

End-to-end Speech Recognition with similar length speech and text

Peng Fan, Wenping Wang, Fei Deng

TL;DR

The paper tackles end-to-end ASR when speech length is similar to text length by downsampling speech using a key-frame mechanism (KFDS) guided by intermediate CTC. It introduces Length Similarity Loss (LSL) with two implementations, Time Independence Loss (TIL) and Aligned Cross Entropy (AXE), plus frame fusion to preserve surrounding context. A three-step training schedule stabilizes learning and improves alignment. Experiments on AISHELL-1 and AISHELL-2 show competitive CER while reducing frames by about 87%, demonstrating substantial computational savings in length-matched ASR without sacrificing accuracy.

Abstract

The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86\% in the number of frames.

End-to-end Speech Recognition with similar length speech and text

TL;DR

The paper tackles end-to-end ASR when speech length is similar to text length by downsampling speech using a key-frame mechanism (KFDS) guided by intermediate CTC. It introduces Length Similarity Loss (LSL) with two implementations, Time Independence Loss (TIL) and Aligned Cross Entropy (AXE), plus frame fusion to preserve surrounding context. A three-step training schedule stabilizes learning and improves alignment. Experiments on AISHELL-1 and AISHELL-2 show competitive CER while reducing frames by about 87%, demonstrating substantial computational savings in length-matched ASR without sacrificing accuracy.

Abstract

The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86\% in the number of frames.

Paper Structure

This paper contains 15 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The overall architecture of the vanilla Conformer-based AED model (a), the AED model with intermediate CTC (b), and the proposed speech length is similar to the text model(KFDS-based mechanism downsampling) (c).
  • Figure 2: Attention-based frame fusion.
  • Figure 3: Concatenate-based frame fusion.
  • Figure 4: Three-step training.