Table of Contents
Fetching ...

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

Yisi Liu, Bohan Yu, Drake Lin, Peter Wu, Cheol Jun Cho, Gopala Krishna Anumanchipalli

TL;DR

This work addresses efficient articulatory speech synthesis by integrating EMA-based features with differentiable DSP (DDSP). It introduces a parameter-efficient DDSP vocoder with a Harmonic-plus-Noise model whose encoder outputs control signals for a harmonic oscillator and a filtered-noise generator, along with a post-convolution module and multi-scale, multi-resolution losses. The model achieves a WER of $6.67\%$ and a MOS of $3.74$ on EMA data, while being $4.9\times$ faster on CPU and realizable at $0.4$M parameters, close to a $9$M-parameter SOTA baseline. These results demonstrate strong synthesis quality and substantial parameter and speed advantages, enabling edge-ready deployment and paving the way for future multi-speaker extensions.

Abstract

Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

TL;DR

This work addresses efficient articulatory speech synthesis by integrating EMA-based features with differentiable DSP (DDSP). It introduces a parameter-efficient DDSP vocoder with a Harmonic-plus-Noise model whose encoder outputs control signals for a harmonic oscillator and a filtered-noise generator, along with a post-convolution module and multi-scale, multi-resolution losses. The model achieves a WER of and a MOS of on EMA data, while being faster on CPU and realizable at M parameters, close to a M-parameter SOTA baseline. These results demonstrate strong synthesis quality and substantial parameter and speed advantages, enabling edge-ready deployment and paving the way for future multi-speaker extensions.

Abstract

Articulatory trajectories like electromagnetic articulography (EMA) provide a low-dimensional representation of the vocal tract filter and have been used as natural, grounded features for speech synthesis. Differentiable digital signal processing (DDSP) is a parameter-efficient framework for audio synthesis. Therefore, integrating low-dimensional EMA features with DDSP can significantly enhance the computational efficiency of speech synthesis. In this paper, we propose a fast, high-quality, and parameter-efficient DDSP articulatory vocoder that can synthesize speech from EMA, F0, and loudness. We incorporate several techniques to solve the harmonics / noise imbalance problem, and add a multi-resolution adversarial loss for better synthesis quality. Our model achieves a transcription word error rate (WER) of 6.67% and a mean opinion score (MOS) of 3.74, with an improvement of 1.63% and 0.16 compared to the state-of-the-art (SOTA) baseline. Our DDSP vocoder is 4.9x faster than the baseline on CPU during inference, and can generate speech of comparable quality with only 0.4M parameters, in contrast to the 9M parameters required by the SOTA.
Paper Structure (28 sections, 8 equations, 7 figures, 3 tables)

This paper contains 28 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overall model architecture. Only the green modules are trainable. The gray blocks are control signals. F0 is pitch, L is loudness, a[n] is the global amplitude, c[n] is the harmonic distribution, and H[n] is the filter frequency response.
  • Figure 2: Encoder architecture. Loudness is fed into the loudness FiLM module as the condition.
  • Figure 3: The spectrograms of the ground truth speech, synthesized speech without GAN, and synthesized speech with GAN. As shown in the boxed regions, without GAN the spectrogram energy bands are averaged out, while with GAN the finer structures are better preserved.
  • Figure 4: WER and UTMOS against model size.
  • Figure 5: Decomposed spectrograms of the utterance "Michael Ashcroft is a British citizen."
  • ...and 2 more figures