Table of Contents
Fetching ...

Disentanglement in a GAN for Unconditional Speech Synthesis

Matthew Baas, Herman Kamper

TL;DR

ASGAN introduces a StyleGAN3-inspired unconditional speech synthesis model that learns a disentangled latent space to directly generate speech from noise. By incorporating anti-aliasing filters and adaptive discriminator updates, ASGAN achieves state-of-the-art or competitive quality on the Google Speech Commands SC09 digits dataset with faster inference than diffusion models. The disentangled latent space enables zero-shot downstream tasks such as voice conversion, speech enhancement, speaker verification, and keyword classification through simple linear latent operations, illustrating broad generalization without task-specific fine-tuning. Limitations include fixed utterance length and inversion quality, with future work aimed at scaling to longer sequences and improving latent-space inversion and control.

Abstract

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/

Disentanglement in a GAN for Unconditional Speech Synthesis

TL;DR

ASGAN introduces a StyleGAN3-inspired unconditional speech synthesis model that learns a disentangled latent space to directly generate speech from noise. By incorporating anti-aliasing filters and adaptive discriminator updates, ASGAN achieves state-of-the-art or competitive quality on the Google Speech Commands SC09 digits dataset with faster inference than diffusion models. The disentangled latent space enables zero-shot downstream tasks such as voice conversion, speech enhancement, speaker verification, and keyword classification through simple linear latent operations, illustrating broad generalization without task-specific fine-tuning. Limitations include fixed utterance length and inversion quality, with future work aimed at scaling to longer sequences and improving latent-space inversion and control.

Abstract

Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN's latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/
Paper Structure (37 sections, 1 equation, 4 figures, 5 tables)

This paper contains 37 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The ASGAN generator $G$ (left) and discriminator $D$ (right). FF, LPF, Conv1D indicate Fourier feature stylegan3_karras2021alias, low-pass filter, and 1D convolution layers, respectively. The number of output features/channels are indicated above linear and convolutional layers. Stacked blocks indicate a layer repeated sequentially.
  • Figure 2: Voice conversion interpolation in the $W-$latent space. Given a source (top left) and target utterance (top right), we smoothly convert the speaker from the source to the reference by linearly interpolating the projected $\mathbf{w}$ vectors from $\mathbf{w}_1$ (source) to $\mathbf{w}_2$ (target) for use in the fine styles. The $W$-space interpolation is illustrated as a 2D linear discriminant analysis (LDA) decomposition of the SC09 test set.
  • Figure 3: Speech enhancement in the $W$-latent space. Given an input utterance (top left) we add noise to it several times and find average direction of decreasing noise. Denoising or increasing noise is then performed by traversing in the latent space by varying amounts in the denoising direction $\boldsymbol{\delta}$ (top). The $W$-space interpolation is illustrated as a 2D partial least-squares regression (PLS2) of all the latent points from the SC09 test set fit on eMOS value as a indication of noise level. Latent variables corresponding to added noisy utterances used to estimate $\boldsymbol{\delta}$ are shown in red ($\color{red} \blacklozenge$).
  • Figure 4: Keyword classification and speech editing in the $W$-latent space. Given a source (top left) and target utterance (top right), we smoothly convert the content from the source to the reference by linearly interpolating the projected $\mathbf{w}$ vectors from $\mathbf{w}_1$ (source) to $\mathbf{w}_2$ (target) for use in the coarse styles. The $W$-space interpolation is illustrated as a 2D LDA decomposition on the SC09 test set fit on spoken digits, showing its usefulness for keyword classification.