Nord-Parl-TTS: Finnish and Swedish TTS Dataset from Parliament Speech
Zirui Li, Jens Edlund, Yicheng Gu, Nhan Phan, Lauri Juvela, Mikko Kurimo
TL;DR
Nord-Parl-TTS introduces an open in-the-wild TTS dataset for Finnish and Swedish derived from parliament recordings, achieving 900 Finnish hours and 5090 Swedish hours, with an adapted Emilia processing pipeline and unified evaluation sets. The authors benchmark two non-autoregressive diffusion-based TTS models, Matcha-TTS and F5-TTS, showing that explicit alignment improves intelligibility while implicit alignment can enhance perceived human-likeness depending on language. The work addresses the Nordic resource gap by providing large-scale, openly available data and evaluation procedures, and outlines future expansion to additional Nordic languages and tooling improvements. Overall, Nord-Parl-TTS offers practical benchmarks and datasets to accelerate TTS development for Finnish and Swedish and serves as a foundation for broader Nordic-language TTS research.
Abstract
Text-to-speech (TTS) development is limited by scarcity of high-quality, publicly available speech data for most languages outside a few high-resource languages. We present Nord-Parl-TTS, an open TTS dataset for Finnish and Swedish based on speech found in the wild. Using recordings of Nordic parliamentary proceedings, we extract 900 hours of Finnish and 5090 hours of Swedish speech suitable for TTS training. The dataset is built using an adapted version of the Emilia data processing pipeline and includes unified evaluation sets to support model development and benchmarking. By offering open, large-scale data for Finnish and Swedish, Nord-Parl-TTS narrows the resource gap in TTS between high- and lower-resourced languages.
