Table of Contents
Fetching ...

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

Sizhou Chen, Songyang Gao, Sen Fang

TL;DR

This paper addresses the limitation of fixed-length attention in Transformer-based ASR by introducing Echo-MSA, a modular variable-length, multi-scale attention mechanism. Echo-MSA, implemented within Echo-Transformer blocks and fused with standard attention via a Dual Focus Gate, enables modeling speech at multiple granularities while reducing computation. A compound loss combining CTC with a weighting mechanism enhances training, and results on LibriSpeech show significant WER reductions, especially in 100h and low-resource scenarios, with robust gains across kernel sizes. The work demonstrates practical improvements in ASR performance and stability, suggesting broad applicability to pre-trained backbones like data2vec and potential extension to larger datasets.

Abstract

The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

TL;DR

This paper addresses the limitation of fixed-length attention in Transformer-based ASR by introducing Echo-MSA, a modular variable-length, multi-scale attention mechanism. Echo-MSA, implemented within Echo-Transformer blocks and fused with standard attention via a Dual Focus Gate, enables modeling speech at multiple granularities while reducing computation. A compound loss combining CTC with a weighting mechanism enhances training, and results on LibriSpeech show significant WER reductions, especially in 100h and low-resource scenarios, with robust gains across kernel sizes. The work demonstrates practical improvements in ASR performance and stability, suggesting broad applicability to pre-trained backbones like data2vec and potential extension to larger datasets.

Abstract

The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.
Paper Structure (15 sections, 3 equations, 3 figures, 3 tables)

This paper contains 15 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Hierarchical Echo-Transformer Training Framework with Multi-Stage Processing.
  • Figure 2: Embedding Echo-MSA with Variable-Length Multi-Scale Attention into Pretrained Models Assisted by Dual Focus Gate at Time Step $\tau$, where $W_{\phi}$ Represents Customizable Variable Length.
  • Figure 3: Word Error Rate (WER) on Librispeech dev-clean: Robustness of Our Model with Different Kernel Sizes for 1h Labeled Data.