Table of Contents
Fetching ...

Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model

Ali Nasr-Esfahani, Mehdi Bekrani, Roozbeh Rajabi

TL;DR

This work tackles robust isolated Persian digit recognition in noisy environments. It introduces a hybrid CNN-BiGRU network that uses word units and MFCC features, with data augmentation to simulate diverse acoustic conditions. The model, inspired by DeepSpeech2 yet modified with residual CNN blocks and BiGRU layers, achieves high accuracies ($98.53\%$ training, $96.10\%$ validation, $95.92\%$ test) and significantly outperforms phoneme-based LSTM and MTDRCC+MLP baselines under noise. The findings demonstrate strong noise resilience and speaker-independence, offering a practical solution for Persian speech interfaces and related applications.

Abstract

Artificial intelligence (AI) has significantly advanced speech recognition applications. However, many existing neural network-based methods struggle with noise, reducing accuracy in real-world environments. This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions, particularly for phonetically similar numbers. A hybrid model combining residual convolutional neural networks and bidirectional gated recurrent units (BiGRU) is proposed, utilizing word units instead of phoneme units for speaker-independent recognition. The FARSDIGIT1 dataset, augmented with various approaches, is processed using Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction. Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets, respectively. In noisy conditions, the proposed approach improves recognition by 26.88% over phoneme unit-based LSTM models and surpasses the Mel-scale Two Dimension Root Cepstrum Coefficients (MTDRCC) feature extraction technique along with MLP model (MTDRCC+MLP) by 7.61%.

Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model

TL;DR

This work tackles robust isolated Persian digit recognition in noisy environments. It introduces a hybrid CNN-BiGRU network that uses word units and MFCC features, with data augmentation to simulate diverse acoustic conditions. The model, inspired by DeepSpeech2 yet modified with residual CNN blocks and BiGRU layers, achieves high accuracies ( training, validation, test) and significantly outperforms phoneme-based LSTM and MTDRCC+MLP baselines under noise. The findings demonstrate strong noise resilience and speaker-independence, offering a practical solution for Persian speech interfaces and related applications.

Abstract

Artificial intelligence (AI) has significantly advanced speech recognition applications. However, many existing neural network-based methods struggle with noise, reducing accuracy in real-world environments. This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions, particularly for phonetically similar numbers. A hybrid model combining residual convolutional neural networks and bidirectional gated recurrent units (BiGRU) is proposed, utilizing word units instead of phoneme units for speaker-independent recognition. The FARSDIGIT1 dataset, augmented with various approaches, is processed using Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction. Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets, respectively. In noisy conditions, the proposed approach improves recognition by 26.88% over phoneme unit-based LSTM models and surpasses the Mel-scale Two Dimension Root Cepstrum Coefficients (MTDRCC) feature extraction technique along with MLP model (MTDRCC+MLP) by 7.61%.

Paper Structure

This paper contains 11 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Input data count after data augmentation.
  • Figure 2: MFCC block diagram dave2013feature.
  • Figure 3: Block diagram of the proposed DNN.
  • Figure 4: Frequency chart of each class in Train data.
  • Figure 5: Frequency chart of each class in the Validation data.
  • ...and 3 more figures