Table of Contents
Fetching ...

Regularizing Learnable Feature Extraction for Automatic Speech Recognition

Peter Vieting, Maximilian Kannen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

TL;DR

The paper tackles the overfitting of learnable feature extractors in ASR by introducing regularization via targeted audio perturbations and a novel STFT-domain SpecAugment. Using a Switchboard 311h setup with a conformer-based CTC model, it demonstrates that tempo-based perturbations and STFT-domain masking before feature extraction substantially improve generalization, reducing reliance on hand-crafted features. The combination of both techniques closes the performance gap to traditional log Mel features, achieving near parity on Hub5'00/Hub5'01 with far less overfitting. This work provides practical, data-efficient guidance for deploying end-to-end ASR systems with learnable front-ends in low-resource settings.

Abstract

Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.

Regularizing Learnable Feature Extraction for Automatic Speech Recognition

TL;DR

The paper tackles the overfitting of learnable feature extractors in ASR by introducing regularization via targeted audio perturbations and a novel STFT-domain SpecAugment. Using a Switchboard 311h setup with a conformer-based CTC model, it demonstrates that tempo-based perturbations and STFT-domain masking before feature extraction substantially improve generalization, reducing reliance on hand-crafted features. The combination of both techniques closes the performance gap to traditional log Mel features, achieving near parity on Hub5'00/Hub5'01 with far less overfitting. This work provides practical, data-efficient guidance for deploying end-to-end ASR systems with learnable front-ends in low-resource settings.

Abstract

Neural front-ends are an appealing alternative to traditional, fixed feature extraction pipelines for automatic speech recognition (ASR) systems since they can be directly trained to fit the acoustic model. However, their performance often falls short compared to classical methods, which we show is largely due to their increased susceptibility to overfitting. This work therefore investigates regularization methods for training ASR models with learnable feature extraction front-ends. First, we examine audio perturbation methods and show that larger relative improvements can be obtained for learnable features. Additionally, we identify two limitations in the standard use of SpecAugment for these front-ends and propose masking in the short time Fourier transform (STFT)-domain as a simple but effective modification to address these challenges. Finally, integrating both regularization approaches effectively closes the performance gap between traditional and learnable features.

Paper Structure

This paper contains 15 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Train and dev connectionist temporal classification (CTC) scores for the baseline training with log Mel and supervised convolutional features (SCF), respectively, demonstrating the overfitting issue for the latter.
  • Figure 2: Baseline vs. proposed masking strategies.