Table of Contents
Fetching ...

JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, Yankun Huang, Yu Gu, Yuan Liu, Yuxuan Wang, Zhangyu Xiao, Ziteng Wang, Boya Dong, Feng Dang, Jinming Chen, Jingdong Li, Jun Wang, Yechen Jin, Yuan Zhang, Zhengyan Sheng, Xin Wang

TL;DR

JoyVoice tackles the challenge of long-context, multi-speaker conversational TTS by introducing a unified End-to-End Transformer-DiT architecture that directly conditions diffusion-based synthesis on autoregressive hidden states. It pairs this with a high-efficiency MM-Tokenizer and dynamic chunking to support streaming long-form dialogue, backed by curriculum learning and reinforcement learning for stability and quality. Key contributions include end-to-end joint optimization, a 12.5 Hz acoustic-token representation, and strong state-of-the-art results on Seed-TTS-Eval and multi-speaker voice cloning benchmarks. The work demonstrates robust, scalable multi-speaker synthesis up to eight speakers and five-minute conversations, with practical implications for expressive, long-form AI voices in multilingual settings.

Abstract

Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice

JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

TL;DR

JoyVoice tackles the challenge of long-context, multi-speaker conversational TTS by introducing a unified End-to-End Transformer-DiT architecture that directly conditions diffusion-based synthesis on autoregressive hidden states. It pairs this with a high-efficiency MM-Tokenizer and dynamic chunking to support streaming long-form dialogue, backed by curriculum learning and reinforcement learning for stability and quality. Key contributions include end-to-end joint optimization, a 12.5 Hz acoustic-token representation, and strong state-of-the-art results on Seed-TTS-Eval and multi-speaker voice cloning benchmarks. The work demonstrates robust, scalable multi-speaker synthesis up to eight speakers and five-minute conversations, with practical implications for expressive, long-form AI voices in multilingual settings.

Abstract

Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice

Paper Structure

This paper contains 23 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Performance comparison between our JoyVoice and competitive speech generation models in terms of content consistency (in CER/WER) on various benchmarks. The error rates of 100 means that the released models do not support the test tasks.
  • Figure 2: An overview of JoyVoice. It Consists of several key components: 1) Causal Autoregressive Transformer: This module takes the system prompt plus text with speaker tags as input, and predicts the discrete tokens generated by the MM-Tokenizer. 2) Dynamic Chunk Diffusion Transformer: Using the hidden representations from the Causal Autoregressive Transformer as input, this component predicts the mel-spectrogram output. 3) MM-Tokenizer and Vocoder: These modules are responsible for converting audio into discrete token representations and reconstructing the mel-spectrogram back to waveform audio, respectively.
  • Figure 3: Illustration of JoyVoice MM-Tokenizer framework.
  • Figure 4: Speaker Distribution in JoyVoice's Long-Form Audio Training Data.
  • Figure 5: Comparison of full-parameter SFT results on five speakers.