Table of Contents
Fetching ...

Covertly improving intelligibility with data-driven adaptations of speech timing

Paige Tuttösí, Angelica Lim, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier

Abstract

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

Covertly improving intelligibility with data-driven adaptations of speech timing

Abstract

Human talkers often address listeners with language-comprehension challenges, such as hard-of-hearing or non-native adults, by globally slowing down their speech. However, it remains unclear whether this strategy actually makes speech more intelligible. Here, we take advantage of recent advancements in machine-generated speech allowing more precise control of speech rate in order to systematically examine how targeted speech-rate adjustments may improve comprehension. We first use reverse-correlation experiments to show that the temporal influence of speech rate prior to a target vowel contrast (ex. the tense-lax distinction) in fact manifests in a scissor-like pattern, with opposite effects in early versus late context windows; this pattern is remarkably stable both within individuals and across native L1-English listeners and L2-English listeners with French, Mandarin, and Japanese L1s. Second, we show that this speech rate structure not only facilitates L2 listeners' comprehension of the target vowel contrast, but that native listeners also rely on this pattern in challenging acoustic conditions. Finally, we build a data-driven text-to-speech algorithm that replicates this temporal structure on novel speech sequences. Across a variety of sentences and vowel contrasts, listeners remained unaware that such targeted slowing improved word comprehension. Strikingly, participants instead judged the common strategy of global slowing as clearer, even though it actually increased comprehension errors. Together, these results show that targeted adjustments to speech rate significantly aid intelligibility under challenging conditions, while often going unnoticed. More generally, this paper provides a data-driven methodology to improve the accessibility of machine-generated speech which can be extended to other aspects of speech comprehension and a wide variety of listeners and environments.

Paper Structure

This paper contains 4 sections, 5 figures.

Figures (5)

  • Figure 1: A data-driven algorithm to manipulate speech rate in clear speech: In a series of three studies, we uncover what parametric temporal contours of speech rate is causal for improving word recognition, in a way that can then be ‘synthesized’ to generate novel, machine-generated speech. (Top-left) Using a speech signal-processing technique, we systematically varied speech rate in phrases surrounding difficult word contrasts (e.g. pill vs peel), and used reverse correlation to extract the temporal contour of rate (or 'kernel') that biases word recognition in one or the other direction (Study 1). (Top-right) We then resynthesized short phrases with graded intensity of speech rate manipulation ('kernel multiplier'), to quantify the kernel effect on word comprehension and investigate the effect of background noise (Study 2). (Bottom) Finally, we combined the data-driven kernel shape and multiplier with state-of-the-art text-to-speech (TTS) synthesis to provide an algorithmic way to make synthesized speech clearer, and tested its effect on both human and machine listeners (Study 3).
  • Figure 2: The temporal contour of rate information intake in leading phrasal context is remarkably stable across individuals and languages (study 1b). We systematically varied speech rate in sentences preceding words distinguished by pairs of vowels that are difficult for L2 speakers: French pull (sweater: /y/) vs poule (chicken: /u/); English pill (/I/) vs peel (/i/). Reverse correlation kernels showed a clear " scissor-shape" pattern, in which slower speech 800-300ms pre-target biased the perception of subsequent words in the direction of the phonetically faster option (French: pull, left; English: pill, middle & right), and slower speech starting 100-200ms pre-target biased the immediately following sound in the direction of the phonetically slower option (French: poule; English: peel). This pattern was conserved almost identically across four languages in both L1 (Top-Left: French-L1 on French; Top-Middle: English L1 on English) and L2 participants (Bottom-Left: English-L1 on French; Bottom-Middle: French L1 on English; Right: Japanese- and Mandarin Chinese-L1 on English).
  • Figure 3: Speech-rate scissor structures affect L2 English vowel comprehension across three L1s (Study 2a) Recognition accuracy for English sentences (I heard them say pill/peel/full) where the scissor-shape rate manipulation was applied at 11 levels of positive and negative (i.e., reversed) intensity, corresponding to distal 'context' speed multipliers ranging left to right from 1.5x (faster) to 0.67x (slower) and, simultaneously, proximal 'word' multipliers from 0.5x (slower) to 2.0x (faster). In all three English-L2 groups (Red: French; Pink: Mandarin; Light Blue: Japanese), accuracy was significantly swayed by applying the speech-rate kernel in one or the other direction. In contrast, English-L1 participants (blue) remained insensitive to the kernel manipulations for all three words. Accuracy displayed normalized with respect to 1x speed. Solid lines: fitted logistic regression. Shaded areas: 95% confidence interval on average accuracy.
  • Figure 4: Native listeners use speech rate as a fall-back strategy in challenging acoustic conditions (study 2b) English-L1 recognition accuracy for manipulated English sentences (I heard them say pill/peel/fool). English-L1 participants showed L2-like behaviour only in the presence of background noise (orange. Left: pill; Middle: peel; for both: blue curve same as Fig. \ref{['fig:validation_1']}) or when the sound was synthesized with ambiguous formants (Right: full). Rate manipulation levels similar as Fig. \ref{['fig:validation_1']}. Accuracy displayed normalized with respect to 1x speed. Solid lines: fitted logistic regression. Shaded areas: 95% confidence interval on average accuracy.
  • Figure 5: Speech rate manipulations improve word comprehension of machine speech for non-native listeners, although they remained unaware of the facilitating effect. Left: Objective intelligibility (word error rate) for 4 alternative strategies of speech-rate ajustments, measured on a sample of N=56 French-L1 participants listening to a set of 32 test sentences. Our proposed data-driven strategy (proximal slowing-down for tense vowels, no change for lax vowels) significantly reduced WER compared to baseline and other global slow-down strategies, both for sentences containing a single target word ( Top) and double target words ( Bottom). Middle: Subjective intelligibility (MOS score) measured in the same task. Listeners rated models that stretched lax targets as significantly more intelligible, in both single-target ( Top) and double-target sentences ( Bottom), although these strategies actually increased WER. Right: Automatic machine speech recognition (ASR) word error rate on the same task. For machines contrary to humans, the 'stretch-everywhere' strategy significantly improved WER over baseline.