Emotion-Coherent Speech Data Augmentation and Self-Supervised Contrastive Style Training for Enhancing Kids's Story Speech Synthesis
Raymond Chung
TL;DR
This work tackles expressive kids' story speech synthesis under limited data by proposing emotion-coherent data augmentation that concatenates emotionally aligned sentences to form long-form audio and a self-supervised contrastive learning approach to refine speaking-style embeddings. The method leverages a fine-tuned sentence emotion predictor and a TP-GST framework to predict and render appropriate speaking styles for multi-sentence inputs, enabling inference of multi-sentence speech in a single pass. Experimental results show improved inter-sentence pause naturalness and higher naturalness and style suitability in subjective evaluations compared with a baseline trained on consecutive sentences. The approach offers a practical pathway to producing more engaging audiobook-style TTS for children with limited data, and opens avenues for richer expressiveness via SSL-based style learning and longer-context training.
Abstract
Expressive speech synthesis requires vibrant prosody and well-timed pauses. We propose an effective strategy to augment a small dataset to train an expressive end-to-end Text-to-Speech model. We merge audios of emotionally congruent text using a text emotion recognizer, creating augmented expressive speech data. By training with two-sentence audio, our model learns natural breaks between lines. We further apply self-supervised contrastive training to improve the speaking style embedding extraction from speech. During inference, our model produces multi-sentence speech in one step, guided by the text-predicted speaking style. Evaluations showcase the effectiveness of our proposed approach when compared to a baseline model trained with consecutive two-sentence audio. Our synthesized speeches give a closer inter-sentence pause distribution to the ground truth speech. Subjective evaluations reveal our synthesized speech scored higher in naturalness and style suitability than the baseline.
