Table of Contents
Fetching ...

How Does a Deep Neural Network Look at Lexical Stress?

Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet

TL;DR

This work probes how deep convolutional networks infer lexical stress in English disyllables by training CNNs on spectrogram representations derived from automatically gathered read and spontaneous speech. It pairs strong predictive performance with a detailed interpretability analysis using Layerwise Relevance Propagation (LRP), Intersection over Union (IOU) of relevance heatmaps, and feature-specific relevance mappings to acoustic cues such as $F_1$, $F_2$, $F_3$, and $F_0$. The best-performing model (e.g., VGG16) relies heavily on the stressed vowel—particularly $F_1$—while still utilizing information from non-stressed regions and other cues, suggesting a distributed, relational set of stress cues learned from natural data. The study provides a publicly available automatic lexical-stress dataset and a robust interpretability framework that bridges deep learning insights with traditional phonetic knowledge, highlighting practical potential for phonetics research and speech technology applications.

Abstract

Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

How Does a Deep Neural Network Look at Lexical Stress?

TL;DR

This work probes how deep convolutional networks infer lexical stress in English disyllables by training CNNs on spectrogram representations derived from automatically gathered read and spontaneous speech. It pairs strong predictive performance with a detailed interpretability analysis using Layerwise Relevance Propagation (LRP), Intersection over Union (IOU) of relevance heatmaps, and feature-specific relevance mappings to acoustic cues such as , , , and . The best-performing model (e.g., VGG16) relies heavily on the stressed vowel—particularly —while still utilizing information from non-stressed regions and other cues, suggesting a distributed, relational set of stress cues learned from natural data. The study provides a publicly available automatic lexical-stress dataset and a robust interpretability framework that bridges deep learning insights with traditional phonetic knowledge, highlighting practical potential for phonetics research and speech technology applications.

Abstract

Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: End-to-end workflow for dataset construction, model training, and interpretability analysis. The blue-colored section illustrates the dataset creation and training process. Audio is collected from various datasets and extracted based on word- and phoneme-level timestamps. Words are tagged for stress using either part-of-speech tagging (minimal pairs) or dictionary entries (no minimal pair words). Once the audio is extracted, the training data (no minimal pair words) is expanded through data augmentation techniques to make CNN-based model training more robust. The red-colored section illustrates the processes of model interpretability and acoustic feature relevance measurement, applied to the test set of minimal pairs. Audio is converted into a spectrogram. This is processed by the trained model; LRP methods are applied to generate heatmaps, revealing the regions of the spectrograms that contributed most to the model’s predictions. The contribution of different (sub)syllabic regions to the LRP heatmap is quantified with IOU metric. For the stressed vowel, acoustic features are extracted and used to generate feature-specific heatmaps; these are compared to the full heatmap to determine which acoustic features are most dominant in the model’s decision-making process.
  • Figure 2: Spectrograms and corresponding attributions for the words "PREsent" and "preSENT" using the VGG16 model. Orange vertical lines represent phoneme boundaries, with each label centered within its phoneme segment; 1 and 0/2 denote stressed and unstressed vowels, respectively. The blue vertical line represents the end of the initial syllable. Panel (a) shows the spectrogram of "PREsent" (IS), while panel (e) shows the spectrogram of "preSENT" (FS). Panels (b)-(d) and (f)-(h) display attributions using various methods: $\mathbf{LRP_{\epsilon}}, \mathbf{LRP_{\alpha1}} \text{ and } \mathbf{LRP_{CMP}}$ respectively, shown over the spectrograms. These LRP variants effectively capture initial versus final stress contrasts, highlighting features at different scales. The $\mathbf{LRP_{\alpha1}}$ (panel (c) vs. (g))and $\mathbf{LRP_{CMP}}$ methods (panel (d) vs. (h)) show clearly different patterns depending on the location of stress. Note that the phoneme labels differ slightly across the two rows; these reflect natural differences in pronunciation between initial and final stress members of this word pair (i.e., stress-related changes to vowel quality).
  • Figure 3: Further illustrative examples of spectrograms and corresponding attributions for the words "subject," "address" and "refuse" with different primary stress location, using the VGG16 model. Vertical lines and annotation follow Figure \ref{['fig:fig2']}. Panel (a)-(d) show spectrograms and corresponding $\mathbf{LRP_{CMP}}$ heatmaps for the word "subject" with initial and final stress, respectively. Similarly, panels (e)-(l) display spectrograms and heatmaps of the words "address" and "refuse," with different stress types. Note that the examples were drawn from datasets in which word segments were zero-padded to reach the fixed 0.5 second window length, as described in Section \ref{['subsec:2:2']}.
  • Figure 4: Heatmaps of the first two formants ($F_1$ and $F_2$) and the original heatmap generated by $\mathbf{LRP_{CMP}}$ using VGG16 for the word "PREsent". Panels (a) and (b) display the heatmaps generated for the two formants, considering bandwidth, spectrogram intensity, and voice intensity within the relevant time points. Finally, panel (c) illustrates the heatmap attributes generated by $\mathbf{LRP_{CMP}}$ utilizing VGG16. Orange vertical lines represent phoneme boundaries, and the blue vertical line represents the end of the initial syllable, same as Fig. \ref{['fig:fig2']}.
  • Figure 5: Classification accuracy for each minimal pair word obtained with the VGG16 model. Each bar represents the proportion of correctly classified tokens for a given minimal pair word. Bars corresponding to words with vowel reduction are shown in dark blue, and those without vowel reduction are shown in light blue with hatching. The number of tokens for each word is indicated in parentheses. Words are ordered by accuracy.