Table of Contents
Fetching ...

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee

Abstract

Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.

Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations

Abstract

Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.
Paper Structure (43 sections, 12 equations, 13 figures, 9 tables)

This paper contains 43 sections, 12 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Analysis of emotional attribute change patterns across temporal gaps (up to a gap of 10) for the EARS richter24_interspeech dataset (verbal only) and the NonverbalTTS borisov2025nonverbaltts dataset (verbal-nonverbal combined speech).
  • Figure 2: Overview of the Affectron training framework. NV candidates are selected and routed to contextually appropriate locations to construct NV-augmented training samples, which are then used to fine-tune the VoiceCraft backbone for affect-aware NV synthesis.
  • Figure 3: AB preference test comparing the proposed NV augmentation with rule-guided randomized strategies following CapSpeech wang2025capspeech.
  • Figure 4: Comparison of our proposed method in terms of NTN-MOS and NEC-MOS. Augmented GT applies our NV augmentation to the ground truth. Vertical lines illustrate the 95% confidence intervals.
  • Figure 5: Comparison of the diversity of generated fine-grained filler variations across models.
  • ...and 8 more figures