Table of Contents
Fetching ...

Is Attention always needed? A Case Study on Language Identification from Speech

Atanu Mandal, Santanu Pal, Indranil Dutta, Mahidas Bhattacharya, Sudip Kumar Naskar

TL;DR

This study investigates language identification from speech using three architectures—CNN, CRNN, and CRNN with Attention—on MFCC features for 13 Indian languages. The CRNN-based models outperform CNN, achieving up to about 0.987 accuracy on the Indian dataset, with strong robustness to noise and clear extensibility to additional languages. Ablation results show kernel size and data weighting significantly impact performance, while Attention yields limited gains relative to its computational cost. The work also demonstrates competitive results on a European-language dataset, highlighting the method's generalizability and practical relevance for multilingual ASR systems.

Abstract

Language Identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilization. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN) based LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the Convolutional Neural Network (CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages and our model resulted in over 98\% classification accuracy. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2% accuracy in a noisy setting when applied to a European Language (EU) dataset.

Is Attention always needed? A Case Study on Language Identification from Speech

TL;DR

This study investigates language identification from speech using three architectures—CNN, CRNN, and CRNN with Attention—on MFCC features for 13 Indian languages. The CRNN-based models outperform CNN, achieving up to about 0.987 accuracy on the Indian dataset, with strong robustness to noise and clear extensibility to additional languages. Ablation results show kernel size and data weighting significantly impact performance, while Attention yields limited gains relative to its computational cost. The work also demonstrates competitive results on a European-language dataset, highlighting the method's generalizability and practical relevance for multilingual ASR systems.

Abstract

Language Identification (LID) is a crucial preliminary process in the field of Automatic Speech Recognition (ASR) that involves the identification of a spoken language from audio samples. Contemporary systems that can process speech in multiple languages require users to expressly designate one or more languages prior to utilization. The LID task assumes a significant role in scenarios where ASR systems are unable to comprehend the spoken language in multilingual settings, leading to unsuccessful speech recognition outcomes. The present study introduces convolutional recurrent neural network (CRNN) based LID, designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples. Furthermore, we replicate certain state-of-the-art methodologies, specifically the Convolutional Neural Network (CNN) and Attention-based Convolutional Recurrent Neural Network (CRNN with attention), and conduct a comparative analysis with our CRNN-based approach. We conducted comprehensive evaluations on thirteen distinct Indian languages and our model resulted in over 98\% classification accuracy. The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar. The proposed LID model exhibits a high degree of extensibility to additional languages and demonstrates a strong resistance to noise, achieving 91.2% accuracy in a noisy setting when applied to a European Language (EU) dataset.

Paper Structure

This paper contains 20 sections, 15 equations, 3 figures, 28 tables.

Figures (3)

  • Figure 1: The figure presents our CRNN framework consisting of a Convolution Block and LSTM Block denoted in different blocks. The convolution block extracts feature from the input audio. The output of the final convolution layer is provided to the Bi-Directional LSTM network as the input which is further connected to a Linear Layer with softmax classifier.
  • Figure 2: Schematic diagram of the Attention Module.
  • Figure 3: Comparison of model results for varying dataset size.