Table of Contents
Fetching ...

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Prabhav Agrawal, Thilo Koehler, Zhiping Xiu, Prashant Serai, Qing He

TL;DR

This work tackles on-device high-quality TTS by proposing an ultra-lightweight DDSP vocoder that jointly optimizes a neural acoustic model with a DSP-based vocoder. By separating excitation and vocal-tract filtering while learning spectral representations directly from audio, the approach achieves MOS scores comparable to neural vocoders with far lower compute (15 MFLOPS) and near-immediate real-time performance on modest hardware ($ ext{RTF}_{ ext{vocoder}} = 0.003$, $ ext{RTF}_{ ext{overall}} = 0.044$ on a 2 GHz CPU). The DDSP framework uses a source-filter DSP vocoder driven by learned $F0$, periodicity $P$, and vocal-tract filter $V$, trained with a combination of reference MSE, multi-window STFT, and adversarial losses in an end-to-end setup. Empirical results show significant FLOPS savings and faster inference compared to MB-MelGAN, while maintaining high speech naturalness, highlighting its practical impact for private, on-device TTS on wearables and low-end devices.

Abstract

Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

TL;DR

This work tackles on-device high-quality TTS by proposing an ultra-lightweight DDSP vocoder that jointly optimizes a neural acoustic model with a DSP-based vocoder. By separating excitation and vocal-tract filtering while learning spectral representations directly from audio, the approach achieves MOS scores comparable to neural vocoders with far lower compute (15 MFLOPS) and near-immediate real-time performance on modest hardware (, on a 2 GHz CPU). The DDSP framework uses a source-filter DSP vocoder driven by learned , periodicity , and vocal-tract filter , trained with a combination of reference MSE, multi-window STFT, and adversarial losses in an end-to-end setup. Empirical results show significant FLOPS savings and faster inference compared to MB-MelGAN, while maintaining high speech naturalness, highlighting its practical impact for private, on-device TTS on wearables and low-end devices.

Abstract

Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.
Paper Structure (13 sections, 6 equations, 3 figures, 4 tables)

This paper contains 13 sections, 6 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our on-device TTS pipeline: The frontend extracts linguistic features, and the prosody model consumes them to output phone-level F0 and duration. Subsequently, the acoustic model takes the upsampled features, to predict frame-level acoustic features, converted to audio waveform by DSP vocoder. In the DDSP vocoder, the acoustic model and DSP vocoder are combined into one single module.
  • Figure 2: Source-Filter Model: The speech signal is generated by mixing an impulse train and white noise according to periodicity, followed by a filter representing the vocal tract and lip radiation.
  • Figure 3: 80-dim $lmel_{psync}$ prediction of our DSP Vocoder Adv (section \ref{['ssec:expsetup']}) vs learned spectral feature from DDSP vocoder with sharper formants and plosives in inverse grey coloring