Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter
TL;DR
The paper investigates whether probabilistic duration modelling improves non-autoregressive TTS, with a focus on spontaneous speech. It substitutes an OT-CFM-based duration model for deterministic predictors across three architectures—FS2, Matcha-TTS, and VITS—and evaluates on four corpora (two read, two spontaneous). Findings show that stochastic duration modelling does not help regression-based FS2, but provides equal or improved naturalness for probabilistic TTS models, especially in spontaneous speech, with only negligible synthesis overhead using flow matching. The study also highlights LJ Speech as a limited benchmark for duration/prosody and advocates for spontaneous-speech benchmarks to drive future TTS research.
Abstract
Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech. Please see https://shivammehta25.github.io/prob_dur/ for audio and resources.
