Table of Contents
Fetching ...

Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech

Hasmot Ali, Md. Fahad Hossain, Md. Mehedi Hasan, Sheikh Abujar, Sheak Rashed Haider Noori

TL;DR

The paper addresses recognizing a speaker's geographical division from continuous Bengali speech by extracting MFCC features and delta dynamics, then classifying with a shallow Artificial Neural Network. Using a dataset of over 45 hours from 633 speakers across eight divisions, the approach achieves up to 85.44% validation accuracy. The study demonstrates the feasibility of region-aware speaker analysis in Bangla and contributes to Bangla NLP and ASR applications, with potential use in security and forensics. The combination of MFCC features, delta features, and a compact ANN provides a practical, low-complexity solution for Bengali division classification on real-world data.

Abstract

Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.

Artificial Neural Networks to Recognize Speakers Division from Continuous Bengali Speech

TL;DR

The paper addresses recognizing a speaker's geographical division from continuous Bengali speech by extracting MFCC features and delta dynamics, then classifying with a shallow Artificial Neural Network. Using a dataset of over 45 hours from 633 speakers across eight divisions, the approach achieves up to 85.44% validation accuracy. The study demonstrates the feasibility of region-aware speaker analysis in Bangla and contributes to Bangla NLP and ASR applications, with potential use in security and forensics. The combination of MFCC features, delta features, and a compact ANN provides a practical, low-complexity solution for Bengali division classification on real-world data.

Abstract

Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.
Paper Structure (13 sections, 4 equations, 7 figures, 1 table)

This paper contains 13 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overall workflow.
  • Figure 2: A ratio of male-female speakers.
  • Figure 3: Amount of data at each label.
  • Figure 4: Average MFCCs for each label.
  • Figure 5: Architecture of Division Recognition Model.
  • ...and 2 more figures