Table of Contents
Fetching ...

WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Tianyi Tan, Jiaxin Ye, Yuanming Zhang, Xiaohuai Le, Xianjun Xia, Chuanzeng Huang, Jing Lu

Abstract

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.

WhispSynth: Scaling Multilingual Whisper Corpus through Real Data Curation and A Novel Pitch-free Generative Framework

Abstract

Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.
Paper Structure (24 sections, 6 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 6 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Different Dynamic Range. Whispers exhibit a significantly lower sound pressure level compared to normal speech, even when the linguistic content is identical.
  • Figure 2: Visualization of the WhispSynth's generation pipeline. We apply CosyVoice3 and DDSP Model to generate pitch-free voice.
  • Figure 3: Results of the listening test. Whisper-likeness Mean Opinion Score (W-MOS) based on 20 participants visualized in a standard box plot.
  • Figure 4: User interface of the listening test.
  • Figure 5: Sample spectrograms from two datasets. (a-f) English samples from the CHAINs Dataset. (g-l) Chinese samples from the WhispNJU Dataset. (a,g) CosyWhisper; (b,h) CosyVoice3 Finetuned on WhispReal; (c,i) CosyVoice3 Repeat; (d,j) WhispSynth; (e,k) WhispReal; (f,l) Normal Speech.