Table of Contents
Fetching ...

Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications

Nirmal Joshua Kapu, Raghav Karan

TL;DR

The paper addresses improving speech signal processing by analyzing convolution-based models from a statistical perspective. It surveys CNNs, Conformers, ResNets, and CRNNs and examines training cost, model size, accuracy, and speed on VoxForge and VoxLingua6, connecting theory to practical ASR, speaker identification, and emotion detection. Conformers achieve the lowest Dev-set error with $WER=5.27\%$ on VoxLingua6, while CNNs provide the most resource-efficient options. The findings guide deployment choices and motivate future work on noise robustness, low-latency designs, and hybrid convolution/self-supervised approaches.

Abstract

This article surveys convolution-based models including convolutional neural networks (CNNs), Conformers, ResNets, and CRNNs-as speech signal processing models and provide their statistical backgrounds and speech recognition, speaker identification, emotion recognition, and speech enhancement applications. Through comparative training cost assessment, model size, accuracy and speed assessment, we compare the strengths and weaknesses of each model, identify potential errors and propose avenues for further research, emphasizing the central role it plays in advancing applications of speech technologies.

Towards Advanced Speech Signal Processing: A Statistical Perspective on Convolution-Based Architectures and its Applications

TL;DR

The paper addresses improving speech signal processing by analyzing convolution-based models from a statistical perspective. It surveys CNNs, Conformers, ResNets, and CRNNs and examines training cost, model size, accuracy, and speed on VoxForge and VoxLingua6, connecting theory to practical ASR, speaker identification, and emotion detection. Conformers achieve the lowest Dev-set error with on VoxLingua6, while CNNs provide the most resource-efficient options. The findings guide deployment choices and motivate future work on noise robustness, low-latency designs, and hybrid convolution/self-supervised approaches.

Abstract

This article surveys convolution-based models including convolutional neural networks (CNNs), Conformers, ResNets, and CRNNs-as speech signal processing models and provide their statistical backgrounds and speech recognition, speaker identification, emotion recognition, and speech enhancement applications. Through comparative training cost assessment, model size, accuracy and speed assessment, we compare the strengths and weaknesses of each model, identify potential errors and propose avenues for further research, emphasizing the central role it plays in advancing applications of speech technologies.

Paper Structure

This paper contains 18 sections, 41 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Convolution-Based Architectures
  • Figure 2: Illustration of a CNN architecture in speech processing.
  • Figure 3: Conformer model architecture
  • Figure 4: ResNet architecture proposed in b31
  • Figure 5: Speech Signal Processing Pipeline
  • ...and 3 more figures