DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li, Rithesh Kumar, Zeyu Jin
TL;DR
DMOSpeech introduces a distilled diffusion-based TTS model that enables direct end-to-end optimization of differentiable metrics such as WER and speaker similarity via SV and CTC losses. By employing Distribution Matching Distillation (DMD2) and a conditional multimodal discriminator, it reduces sampling to 4 steps while maintaining or improving quality, outperforming the teacher in several metrics. The approach demonstrates strong correlations between optimized metrics and human judgments, delivering faster inference (over 13×) and improved naturalness, intelligibility, and speaker similarity on LibriLight data. It also discusses mode shrinkage and ethical considerations for high-similarity synthetic speech, pointing to future work in reinforcement learning from human feedback and differentiable metric design.
Abstract
Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are computationally intensive, and previous distillation attempts have shown consistent quality degradation. Moreover, existing TTS approaches are limited by non-differentiable components or iterative sampling that prevent true end-to-end optimization with perceptual metrics. We introduce DMOSpeech, a distilled diffusion-based TTS model that uniquely achieves both faster inference and superior performance compared to its teacher model. By enabling direct gradient pathways to all model components, we demonstrate the first successful end-to-end optimization of differentiable metrics in TTS, incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization. The audio samples are available at https://dmospeech.github.io/.
