Table of Contents
Fetching ...

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Zhenqi Jia, Rui Liu

TL;DR

This work tackles the challenge of generating conversational speech that matches the prosody of target utterances by leveraging multimodal dialogue history (MDH). It proposes I^3-CSS, a model that explicitly models intra-modal and inter-modal interactions between MDH and the target utterance across text and speech modalities, using four modal combinations and contrastive learning to align semantics and prosody. On the DailyTalk dataset, I^3-CSS outperforms strong baselines on both subjective (N-DMOS, P-DMOS) and objective (MAE-P, MAE-E, MAE-D) metrics, with statistically significant gains, demonstrating the value of explicit interaction modeling for prosody expressiveness. The approach introduces a novel combination of text/speech encoders, four interaction modules, and a phoneme-based text encoder with a HiFi-GAN vocoder, and provides code and samples for replication.

Abstract

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

TL;DR

This work tackles the challenge of generating conversational speech that matches the prosody of target utterances by leveraging multimodal dialogue history (MDH). It proposes I^3-CSS, a model that explicitly models intra-modal and inter-modal interactions between MDH and the target utterance across text and speech modalities, using four modal combinations and contrastive learning to align semantics and prosody. On the DailyTalk dataset, I^3-CSS outperforms strong baselines on both subjective (N-DMOS, P-DMOS) and objective (MAE-P, MAE-E, MAE-D) metrics, with statistically significant gains, demonstrating the value of explicit interaction modeling for prosody expressiveness. The approach introduces a novel combination of text/speech encoders, four interaction modules, and a phoneme-based text encoder with a HiFi-GAN vocoder, and provides code and samples for replication.

Abstract

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical Text-Next Speech, and Historical Speech-Next Text. Then, we design two contrastive learning-based intra-modal and two inter-modal interaction modules to deeply learn the intra-modal and inter-modal context interaction. In the inference phase, we take MDH and adopt trained interaction modules to fully infer the speech prosody of the target utterance's text content. Subjective and objective experiments on the DailyTalk dataset show that III-CSS outperforms the advanced baselines in terms of prosody expressiveness. Code and speech samples are available at https://github.com/AI-S2-Lab/I3CSS.

Paper Structure

This paper contains 15 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The overview of I$^{3}$-CSS consists of Intra-modal Interaction Modules, Inter-modal Interaction Modules, Text Encoder, and Speech Synthesizer.