Robust Persian Digit Recognition in Noisy Environments Using Hybrid CNN-BiGRU Model
Ali Nasr-Esfahani, Mehdi Bekrani, Roozbeh Rajabi
TL;DR
This work tackles robust isolated Persian digit recognition in noisy environments. It introduces a hybrid CNN-BiGRU network that uses word units and MFCC features, with data augmentation to simulate diverse acoustic conditions. The model, inspired by DeepSpeech2 yet modified with residual CNN blocks and BiGRU layers, achieves high accuracies ($98.53\%$ training, $96.10\%$ validation, $95.92\%$ test) and significantly outperforms phoneme-based LSTM and MTDRCC+MLP baselines under noise. The findings demonstrate strong noise resilience and speaker-independence, offering a practical solution for Persian speech interfaces and related applications.
Abstract
Artificial intelligence (AI) has significantly advanced speech recognition applications. However, many existing neural network-based methods struggle with noise, reducing accuracy in real-world environments. This study addresses isolated spoken Persian digit recognition (zero to nine) under noisy conditions, particularly for phonetically similar numbers. A hybrid model combining residual convolutional neural networks and bidirectional gated recurrent units (BiGRU) is proposed, utilizing word units instead of phoneme units for speaker-independent recognition. The FARSDIGIT1 dataset, augmented with various approaches, is processed using Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction. Experimental results demonstrate the model's effectiveness, achieving 98.53%, 96.10%, and 95.92% accuracy on training, validation, and test sets, respectively. In noisy conditions, the proposed approach improves recognition by 26.88% over phoneme unit-based LSTM models and surpasses the Mel-scale Two Dimension Root Cepstrum Coefficients (MTDRCC) feature extraction technique along with MLP model (MTDRCC+MLP) by 7.61%.
