Table of Contents
Fetching ...

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin

TL;DR

DMOSpeech introduces a distilled diffusion-based TTS model that enables direct end-to-end optimization of differentiable metrics such as WER and speaker similarity via SV and CTC losses. By employing Distribution Matching Distillation (DMD2) and a conditional multimodal discriminator, it reduces sampling to 4 steps while maintaining or improving quality, outperforming the teacher in several metrics. The approach demonstrates strong correlations between optimized metrics and human judgments, delivering faster inference (over 13×) and improved naturalness, intelligibility, and speaker similarity on LibriLight data. It also discusses mode shrinkage and ethical considerations for high-similarity synthetic speech, pointing to future work in reinforcement learning from human feedback and differentiable metric design.

Abstract

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/.

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

TL;DR

DMOSpeech introduces a distilled diffusion-based TTS model that enables direct end-to-end optimization of differentiable metrics such as WER and speaker similarity via SV and CTC losses. By employing Distribution Matching Distillation (DMD2) and a conditional multimodal discriminator, it reduces sampling to 4 steps while maintaining or improving quality, outperforming the teacher in several metrics. The approach demonstrates strong correlations between optimized metrics and human judgments, delivering faster inference (over 13×) and improved naturalness, intelligibility, and speaker similarity on LibriLight data. It also discusses mode shrinkage and ethical considerations for high-similarity synthetic speech, pointing to future work in reinforcement learning from human feedback and differentiable metric design.

Abstract

Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/.

Paper Structure

This paper contains 29 sections, 33 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the DMOSpeech framework. The framework consists of inference and three main components for training: (1) Inference (upper left): A few-step distilled generator $G_\theta$ synthesizes speech directly from noise, conditioned on the text and speaker prompt (red arrow). (2) Distribution Matching Distillation (upper right): Gradient computation for DMD loss where the student score model $g_{\boldsymbol{\psi}}$ matches the teacher score model $f_{\boldsymbol{\phi}}$ to align the distribution of student generator $G_\theta$ with the teacher distribution (purple arrow). (3) Multi-Modal Adversarial Training (lower right): The discriminator $D$ uses stacked features from the student score model to distinguish between real and synthesized noisy latents conditioned on both text and prompt (yellow arrows). (4) Direct Metric Optimization (lower left): Direct metric optimization for word error rate (WER) via CTC loss (pink arrow) and speaker embedding cosine similarity (SIM) via SV loss (blue arrow).
  • Figure 2: Illustration of mode shrinkage in terms of pitch. Speech with the same text and prompt were synthesized 50 times, and their frame-level F0 values are shown as histograms and kernel density estimates. The red dashed line represents the mean F0 value of the prompt. In both examples, the student's distribution shifts toward the most likely region, centering around the prompt's mean value.
  • Figure 3: Scatter plot of human-rated voice similarity (SMOS-V) versus speaker embedding cosine similarity (SIM) at the utterance level. The correlation coefficient is 0.55.
  • Figure 4: Two examples for mode coverage with continuation task from LibriSpeech test-clean subset. The model continues from a prompt with the exact same text as the ground truth. This task synthesizes speech with varying prompts and texts but from the same speaker, allowing us to compare the mode coverage without the same text and prompt. The student exhibits very similar behavior to the teacher and shows minimal mode shrinkage. The misalignment in energy between ground truth and our models is caused by normalization during data pre-processing where the audio is normalized between -1 to 1 in amplitude, causing the generated samples to have a different amplitude range.
  • Figure 5: Top: Scatter plots showing the relationship between human-rated naturalness (MOS-N) and sound quality (MOS-Q) versus word error rate (WER). The correlation coefficients are -0.16 for both, indicating a weak negative correlation ($p \ll 0.01$). Bottom: Scatter plots of human-rated voice similarity (SMOS-V) and style similarity (SMOS-S) versus speaker embedding cosine similarity (SIM). The correlation coefficients are 0.55 and 0.50, reflecting a strong positive correlation ($p \ll 0.01$). These plots demonstrate how objective evaluations (WER and SIM) align with subjective human ratings.
  • ...and 1 more figures