Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Prabhav Agrawal; Thilo Koehler; Zhiping Xiu; Prashant Serai; Qing He

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Prabhav Agrawal, Thilo Koehler, Zhiping Xiu, Prashant Serai, Qing He

TL;DR

This work tackles on-device high-quality TTS by proposing an ultra-lightweight DDSP vocoder that jointly optimizes a neural acoustic model with a DSP-based vocoder. By separating excitation and vocal-tract filtering while learning spectral representations directly from audio, the approach achieves MOS scores comparable to neural vocoders with far lower compute (15 MFLOPS) and near-immediate real-time performance on modest hardware ($ ext{RTF}_{ ext{vocoder}} = 0.003$, $ ext{RTF}_{ ext{overall}} = 0.044$ on a 2 GHz CPU). The DDSP framework uses a source-filter DSP vocoder driven by learned $F0$, periodicity $P$, and vocal-tract filter $V$, trained with a combination of reference MSE, multi-window STFT, and adversarial losses in an end-to-end setup. Empirical results show significant FLOPS savings and faster inference compared to MB-MelGAN, while maintaining high speech naturalness, highlighting its practical impact for private, on-device TTS on wearables and low-end devices.

Abstract

Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

TL;DR

on a 2 GHz CPU). The DDSP framework uses a source-filter DSP vocoder driven by learned

, periodicity

, and vocal-tract filter

, trained with a combination of reference MSE, multi-window STFT, and adversarial losses in an end-to-end setup. Empirical results show significant FLOPS savings and faster inference compared to MB-MelGAN, while maintaining high speech naturalness, highlighting its practical impact for private, on-device TTS on wearables and low-end devices.

Abstract

Paper Structure (13 sections, 6 equations, 3 figures, 4 tables)

This paper contains 13 sections, 6 equations, 3 figures, 4 tables.

Introduction
Proposed On-device TTS System
Frontend Components
DDSP Vocoder
DSP Vocoder
Acoustic Model
Joint Modeling via DDSP
Training
Results
Experimental setup for comparison
Speech Synthesis Quality Evaluation
Model Complexity
Conclusion

Figures (3)

Figure 1: Our on-device TTS pipeline: The frontend extracts linguistic features, and the prosody model consumes them to output phone-level F0 and duration. Subsequently, the acoustic model takes the upsampled features, to predict frame-level acoustic features, converted to audio waveform by DSP vocoder. In the DDSP vocoder, the acoustic model and DSP vocoder are combined into one single module.
Figure 2: Source-Filter Model: The speech signal is generated by mixing an impulse train and white noise according to periodicity, followed by a filter representing the vocal tract and lip radiation.
Figure 3: 80-dim $lmel_{psync}$ prediction of our DSP Vocoder Adv (section \ref{['ssec:expsetup']}) vs learned spectral feature from DDSP vocoder with sharper formants and plosives in inverse grey coloring

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

TL;DR

Abstract

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)