Table of Contents
Fetching ...

Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets

Or Haim Anidjar, Roi Yozevitch

TL;DR

This work advances multilingual spoken language recognition by extending x-vector architectures with a temporal pooling strategy and a funnel-shaped TDNN. It systematically optimizes TDNN hyperparameters through grid search, augments data with speed/pitch/noise variations, and introduces 1x1 TDNN layers to enhance local feature processing. Trained on ten languages from Indo-European, Semitic, and East Asian families using Common Voice, the approach achieves near-state-of-the-art accuracy (approaching 97%) with improved efficiency. The findings underscore the practical potential for robust, scalable multilingual SLR in real-world applications, and point to future work on unverified data and speaker separation in overlapping speech.

Abstract

In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized pooling layer. We utilized a broad dataset range from Common-Voice, targeting ten languages across Indo-European, Semitic, and East Asian families. The major innovation involved optimizing the architecture of Time Delay Neural Networks. We introduced additional layers and restructured these networks into a funnel shape, enhancing their ability to process complex linguistic patterns. A rigorous grid search determined the optimal settings for these networks, significantly boosting their efficiency in language pattern recognition from audio samples. The model underwent extensive training, including a phase with augmented data, to refine its capabilities. The culmination of these efforts is a highly accurate system, achieving a 97\% accuracy rate in language recognition. This advancement represents a notable contribution to artificial intelligence, specifically in improving the accuracy and efficiency of language processing systems, a critical aspect in the engineering of advanced speech recognition technologies.

Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets

TL;DR

This work advances multilingual spoken language recognition by extending x-vector architectures with a temporal pooling strategy and a funnel-shaped TDNN. It systematically optimizes TDNN hyperparameters through grid search, augments data with speed/pitch/noise variations, and introduces 1x1 TDNN layers to enhance local feature processing. Trained on ten languages from Indo-European, Semitic, and East Asian families using Common Voice, the approach achieves near-state-of-the-art accuracy (approaching 97%) with improved efficiency. The findings underscore the practical potential for robust, scalable multilingual SLR in real-world applications, and point to future work on unverified data and speaker separation in overlapping speech.

Abstract

In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized pooling layer. We utilized a broad dataset range from Common-Voice, targeting ten languages across Indo-European, Semitic, and East Asian families. The major innovation involved optimizing the architecture of Time Delay Neural Networks. We introduced additional layers and restructured these networks into a funnel shape, enhancing their ability to process complex linguistic patterns. A rigorous grid search determined the optimal settings for these networks, significantly boosting their efficiency in language pattern recognition from audio samples. The model underwent extensive training, including a phase with augmented data, to refine its capabilities. The culmination of these efforts is a highly accurate system, achieving a 97\% accuracy rate in language recognition. This advancement represents a notable contribution to artificial intelligence, specifically in improving the accuracy and efficiency of language processing systems, a critical aspect in the engineering of advanced speech recognition technologies.
Paper Structure (28 sections, 4 figures, 7 tables)

This paper contains 28 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Architecture Source Model (Baseline) based on the github.com/KrishnaDN/x-vector-pytorch.
  • Figure 2: Network topology: The initial phase involves inputting an mp3 file, which undergoes transformation to wav format during data preprocessing. This is followed by augmentations like pitch shift, speed modulation, and Gaussian noise addition, before feeding into the TDNN model, looping over a Grid Search.
  • Figure 3: Confusion Matrix on Validation Dataset for the Final Model, showing minimal error.
  • Figure 4: Confusion Matrix on Test Dataset for the Final Model, indicating potential overfitting compared to validation results.