Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

François Deloche; Laurent Bonnasse-Gahot; Judit Gervain

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

François Deloche, Laurent Bonnasse-Gahot, Judit Gervain

TL;DR

This paper tackles the problem of capturing the acoustic bases of speech rhythm beyond traditional metrics by training a medium-sized LSTM on language identification using rhythm-focused inputs (amplitude envelopes and voicing). The approach demonstrates that learned representations exhibit alignment with rhythm typologies and that some activations statistically relate to established rhythm metrics, enabling interpretable rhythm-aware language maps. Despite modest raw accuracy, the work shows meaningful rhythm-related structure in neural representations and highlights the potential of deep learning to advance rhythm research, complementing classic metrics. The study also provides open resources and a data-driven framework for exploring rhythmic regularities across languages, with implications for psycholinguistics and language learning technologies.

Abstract

Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide insights into how adult listeners process non-native languages. Despite the relative success of rhythm metrics in supporting the existence of linguistic rhythmic classes, quantitative studies have yet to capture the full complexity of temporal regularities associated with speech rhythm. We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm. To explore this hypothesis, we trained a medium-sized recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network had access to the amplitude envelopes and a variable identifying the voiced segments, assuming that this signal would poorly convey phonetic information but preserve prosodic features. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases. Visualization methods show that representations built from the network activations are consistent with speech rhythm typologies, although the resulting maps are more complex than two separated clusters between stress and syllable-timed languages. We further analyzed the model by identifying correlations between network activations and known speech rhythm metrics. The findings illustrate the potential of deep learning tools to advance our understanding of speech rhythm through the identification and exploration of linguistically relevant acoustic feature spaces.

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 7 figures, 1 table)

This paper contains 11 sections, 4 equations, 7 figures, 1 table.

Introduction
Methods
Speech data
Model inputs
Model architecture
Visualization methods
Comparison with speech metrics
Results
Language discrimination
Correlations with rhythm metrics
Discussion

Figures (7)

Figure 1: Features for the language identification task illustrated on the sentence: 'a hurricane was announced this afternoon on the TV' (from the Ramus corpus ramus1999a). SPL: Sound pressure level. SPL-H: Sound pressure level after the signal is passed through a gentle high-pass filter. F0 : the fundamental frequency. In the main version of the model presented in the paper, only the voicing information is kept for the third dimension -- 1 for voiced segments ($F_0 \neq 0$); 0 for voiceless segments ($F_0=0$). The three features are sampled at 31.25 Hz. The spectrogram of the sentence is shown for guidance (background image).
Figure 2: Block diagram of the recurrent neural network used for the language identification task. The input layer consists of the three features and the associated deltas (differences between two time steps) sampled at 31.25 Hz. The two hidden layers contain 150 LSTM units each; the model output after applying the softmax function can be interpreted as a probability vector of the most likely languages identified by the network.
Figure 3: (a) Model accuracy and (b) top-3 accuracy during training as a function of epoch for the language identification task. After 25 epochs corresponding to the trained version presented in the paper (early stopping), the test accuracy stopped improving. (c) Mean accuracy after training at different time points within the recordings (0= start of recordings, 10 sec=end of recordings). The horizontal dashed lines indicate the chance levels assuming equiprobability of the language classes. Light lines: raw numbers; dark lines: smoothed data; colored dashed lines: data between epochs 26 and 30 (for reference).
Figure 4: Confusion matrix on the test set for a limited number of languages (languages from the Ramus corpus ramus1999a). The figures correspond to percentages (normalized by row). The entire confusion matrix is provided as Supplementary Fig. Supp. 6.
Figure 5: (a) Hierarchical clustering dendrogram based on histograms of the DNN probability vector output using the complete linkage method and the Bhattacharyya distance. (b) Metric dimensional scaling (MDS) visualization for the languages in the right branch of the dendrogram, also based on the Bhattacharyya distance between activation histograms. MDS stress: 0.14.
...and 2 more figures

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

TL;DR

Abstract

Acoustic characterization of speech rhythm: going beyond metrics with recurrent neural networks

Authors

TL;DR

Abstract

Table of Contents

Figures (7)