Table of Contents
Fetching ...

An Attention Long Short-Term Memory based system for automatic classification of speech intelligibility

Miguel Fernández-Díaz, Ascensión Gallardo-Antolín

TL;DR

This paper tackles automatic, non-intrusive prediction of speech intelligibility in dysarthric speech using LSTM networks fed with log-mell spectrograms, augmented by a simple attention mechanism to emphasize informative frames. It juxtaposes a conventional SVM baseline with hand-crafted features against three LSTM variants (Basic, Mean-Pooling, and Attention-Pooling), with systematic exploration of preprocessing choices such as VAD. On the UA-Speech dataset, the Attention-Pooling LSTM achieves the highest accuracy, surpassing the SVM baselines and the other LSTM variants, demonstrating the value of temporal modeling and frame-level attention for intelligibility classification. The approach promises practical impact for clinical monitoring and therapy by providing a non-intrusive, objective, and repeatable intelligibility assessment method. All mathematical expressions are presented within $...$ delimiters to ensure precise representation of the modeling constructs.

Abstract

Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.

An Attention Long Short-Term Memory based system for automatic classification of speech intelligibility

TL;DR

This paper tackles automatic, non-intrusive prediction of speech intelligibility in dysarthric speech using LSTM networks fed with log-mell spectrograms, augmented by a simple attention mechanism to emphasize informative frames. It juxtaposes a conventional SVM baseline with hand-crafted features against three LSTM variants (Basic, Mean-Pooling, and Attention-Pooling), with systematic exploration of preprocessing choices such as VAD. On the UA-Speech dataset, the Attention-Pooling LSTM achieves the highest accuracy, surpassing the SVM baselines and the other LSTM variants, demonstrating the value of temporal modeling and frame-level attention for intelligibility classification. The approach promises practical impact for clinical monitoring and therapy by providing a non-intrusive, objective, and repeatable intelligibility assessment method. All mathematical expressions are presented within delimiters to ensure precise representation of the modeling constructs.

Abstract

Speech intelligibility can be degraded due to multiple factors, such as noisy environments, technical difficulties or biological conditions. This work is focused on the development of an automatic non-intrusive system for predicting the speech intelligibility level in this latter case. The main contribution of our research on this topic is the use of Long Short-Term Memory (LSTM) networks with log-mel spectrograms as input features for this purpose. In addition, this LSTM-based system is further enhanced by the incorporation of a simple attention mechanism that is able to determine the more relevant frames to this task. The proposed models are evaluated with the UA-Speech database that contains dysarthric speech with different degrees of severity. Results show that the attention LSTM architecture outperforms both, a reference Support Vector Machine (SVM)-based system with hand-crafted features and a LSTM-based system with Mean-Pooling.
Paper Structure (21 sections, 3 equations, 6 figures, 3 tables)

This paper contains 21 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Block diagram of the intelligibility level classification systems developed in this work.
  • Figure 2: Average energy of the modulation spectrum of a speech recording with (a) high intelligibility and (b) low intelligibility. Both utterances correspond to the word "jowls".
  • Figure 3: Two different LSTM-based architectures for speech intelligibility classification. (a) Basic LSTM; (b) LSTM with Mean-Pooling. In brackets, the dimension of each variable, where $T$, $n_B$, $L$, $n_{D1}$, $n_L$, $n_{D2}$ and $n_C$, stand for the number of frames of the input signal, the number of mel filters, the length of the LSTM input/output sequence, the number of neurons in the first dense layer, the number of LSTM units, the number of neurons in the second dense layer and the number of classes (intelligibility levels), respectively.
  • Figure 4: LSTM-based architecture with the attention mechanism. In brackets, the dimension of each variable, where $T$, $n_B$, $L$, $n_{D1}$, $n_L$, $n_{D2}$ and $n_C$, stand for the number of frames of the input signal, the number of mel filters, the length of the LSTM input/output sequence, the number of neurons in the first dense layer, the number of LSTM units, the number of neurons in the second dense layer and the number of classes (intelligibility levels), respectively.
  • Figure 5: Histogram of the length of the audio files in the database.
  • ...and 1 more figures