Table of Contents
Fetching ...

LSTM-CNN Network for Audio Signature Analysis in Noisy Environments

Praveen Damacharla, Hamid Rajabalipanah, Mohammad Hosein Fakheri

TL;DR

This work tackles the problem of counting concurrent speakers and estimating their gender in noisy environments where the exact number of active speakers is unknown and may reach up to $n+m ≤ 10$. It introduces a regression-based hybrid CNN-BLSTM-FC architecture that processes MFCC features to output $N_{men}$ and $N_{women}$. Using a dataset of 19,000 five-second mixtures with diverse genders, ages, accents and background noises at $SNR ≈ 10$ dB, the model achieves a validation MSE as low as $0.017$, outperforming FC, CNN-FC, and LSTM-FC baselines. The work identifies optimal hyperparameters (e.g., kernel size $5×5$, 256 filters, 7 CNN layers, and 2-second input windows) and demonstrates robustness across random train/test splits, highlighting practical potential for real-world multispeaker analytics in industry and public spaces.

Abstract

There are multiple applications to automatically count people and specify their gender at work, exhibitions, malls, sales, and industrial usage. Although current speech detection methods are supposed to operate well, in most situations, in addition to genders, the number of current speakers is unknown and the classification methods are not suitable due to many possible classes. In this study, we focus on a long-short-term memory convolutional neural network (LSTM-CNN) to extract time and / or frequency-dependent features of the sound data to estimate the number / gender of simultaneous active speakers at each frame in noisy environments. Considering the maximum number of speakers as 10, we have utilized 19000 audio samples with diverse combinations of males, females, and background noise in public cities, industrial situations, malls, exhibitions, workplaces, and nature for learning purposes. This proof of concept shows promising performance with training/validation MSE values of about 0.019/0.017 in detecting count and gender.

LSTM-CNN Network for Audio Signature Analysis in Noisy Environments

TL;DR

This work tackles the problem of counting concurrent speakers and estimating their gender in noisy environments where the exact number of active speakers is unknown and may reach up to . It introduces a regression-based hybrid CNN-BLSTM-FC architecture that processes MFCC features to output and . Using a dataset of 19,000 five-second mixtures with diverse genders, ages, accents and background noises at dB, the model achieves a validation MSE as low as , outperforming FC, CNN-FC, and LSTM-FC baselines. The work identifies optimal hyperparameters (e.g., kernel size , 256 filters, 7 CNN layers, and 2-second input windows) and demonstrates robustness across random train/test splits, highlighting practical potential for real-world multispeaker analytics in industry and public spaces.

Abstract

There are multiple applications to automatically count people and specify their gender at work, exhibitions, malls, sales, and industrial usage. Although current speech detection methods are supposed to operate well, in most situations, in addition to genders, the number of current speakers is unknown and the classification methods are not suitable due to many possible classes. In this study, we focus on a long-short-term memory convolutional neural network (LSTM-CNN) to extract time and / or frequency-dependent features of the sound data to estimate the number / gender of simultaneous active speakers at each frame in noisy environments. Considering the maximum number of speakers as 10, we have utilized 19000 audio samples with diverse combinations of males, females, and background noise in public cities, industrial situations, malls, exhibitions, workplaces, and nature for learning purposes. This proof of concept shows promising performance with training/validation MSE values of about 0.019/0.017 in detecting count and gender.
Paper Structure (7 sections, 5 figures, 4 tables)

This paper contains 7 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The illustration of the overall architecture for the proposed multiple speaker counting method.
  • Figure 2: The proposed hybrid CNN-LSTM-FC architecture for estimating the gender/number of the speakers in noisy environments
  • Figure 3: The MSE values corresponding to (a) FC, (b) CNN-FC, (c) LSTM-FC, and (d) CNN-LSTM-FC networks
  • Figure 4: Illustration of the effects of pre-windowing of the input audio signal on the MSE values of the proposed model. a) window size=q=32000 samples, shift=16000 samples, and b) window size=q=16000 samples, shift=8000 samples, and c) window size=q= 8000 samples, shift= 4000 samples.
  • Figure 5: The MSE values of the proposed CNN-LSTM-FC architecture upon variation of train-test data splitting