Table of Contents
Fetching ...

Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech

S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan

TL;DR

This work introduces a quartered spectral envelope (QSE) feature that captures pitch-harmonic information concentrated in the first spectral quarter and uses a 1D-CNN to classify whispered versus normally phonated speech. The approach achieves near-perfect accuracy on wTIMIT (99.31%) and perfect accuracy on CHAINS (100%), and generally outperforms MFCC baselines while matching or surpassing the LFBE-LSTM state of the art with lower computational cost. Analyses show that kernel size, model depth, and sampling rate influence performance, with 16 kHz sampling and a compact architecture providing optimal results. The method demonstrates robustness to white noise and offers a practical front-end solution for inclusive speech systems that must handle whisper, making it suitable for real-time, low-overhead deployment.

Abstract

Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.

Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech

TL;DR

This work introduces a quartered spectral envelope (QSE) feature that captures pitch-harmonic information concentrated in the first spectral quarter and uses a 1D-CNN to classify whispered versus normally phonated speech. The approach achieves near-perfect accuracy on wTIMIT (99.31%) and perfect accuracy on CHAINS (100%), and generally outperforms MFCC baselines while matching or surpassing the LFBE-LSTM state of the art with lower computational cost. Analyses show that kernel size, model depth, and sampling rate influence performance, with 16 kHz sampling and a compact architecture providing optimal results. The method demonstrates robustness to white noise and offers a practical front-end solution for inclusive speech systems that must handle whisper, making it suitable for real-time, low-overhead deployment.

Abstract

Whisper, as a form of speech, is not sufficiently addressed by mainstream speech applications. This is due to the fact that systems built for normal speech do not work as expected for whispered speech. A first step to building a speech application that is inclusive of whispered speech, is the successful classification of whispered speech and normal speech. Such a front-end classification system is expected to have high accuracy and low computational overhead, which is the scope of this paper. One of the characteristics of whispered speech is the absence of the fundamental frequency (or pitch), and hence the pitch harmonics as well. The presence of the pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform. We observe that this characteristic is predominant in the first quarter of the spectrum, and exploit the same as a feature. We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features from the quartered spectral envelope (QSE). The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset. The proposed feature is compared with Mel frequency cepstral coefficients (MFCC), a staple in the speech domain. The proposed classification system is also compared with the state-of-the-art system based on log-filterbank energy (LFBE) features trained on long short-term memory (LSTM) network. The proposed system based on 1D-CNN performs better than, or as good as, the state-of-the-art across multiple experiments. It also converges sooner, with lesser computational overhead. Finally, the proposed system is evaluated under the presence of white noise at various signal-to-noise ratios and found to be robust.
Paper Structure (21 sections, 1 equation, 6 figures, 7 tables)

This paper contains 21 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Spectrogram computed using a 1024 point FFT, showing the differences between normal (above) and whispered speech (below), sampled at 44.1kHz.
  • Figure 2: Quartered Spectrogram computed using a 1024 point FFT, showing bins ranging from 1 to 128, for normal (above) and whispered speech (below), sampled at 44.1kHz.
  • Figure 3: The architecture of the 1D-CNN proposed in the current work. Figure shows the architecture that offered the best results.
  • Figure 4: A section of the spectrogram computed with 1024 point FFT, spanning from bins 1 to 128, corresponding to normal speech, sampled at 44.1kHz (above) and 16kHz (below).
  • Figure 5: MFCC features extracted from normal and whispered speech, sampled at 16kHz.
  • ...and 1 more figures