Table of Contents
Fetching ...

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu

Abstract

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Abstract

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.

Paper Structure

This paper contains 15 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of the MOSS-VoiceGenerator inference. The voice description and text are concatenated and fed into a causal language model with delay-pattern generation; the output audio tokens are decoded by MOSS-Audio-Tokenizer.
  • Figure 2: Data collection pipeline for MOSS-VoiceGenerator. Phase 1 annotates cinematic audio via speaker diarization, denoising and quality filtering, single-speaker filtering, and ASR transcription, followed by speech captioning and timbre instruction generation. Phase 2 augments the corpus by training a speech-text embedding model for retrieval from internal TTS data and a fine-tuned caption model for scalable annotation.
  • Figure 3: A snapshot of the training corpus profiled along three perceptual dimensions based on the caption results. The distributions reveal broad, naturalistic coverage of everyday speaking styles.
  • Figure 4: Pairwise preference results (Win / Tie / Lose) of MOSS-VoiceGenerator against three baselines across three evaluation dimensions. Each bar reports the percentage of comparisons won, tied, or lost. MOSS-VoiceGenerator consistently wins on all three dimensions against all baselines.