Table of Contents
Fetching ...

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

Armani Rodriguez, Silvija Kokalj-Filipovic

TL;DR

This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability, and demonstrates VQalAttent's capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.

Abstract

Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables guided improvements in larger, commercial generative models. As a valuable tool for understanding and refining audio synthesis, our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.

VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space

TL;DR

This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability, and demonstrates VQalAttent's capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.

Abstract

Generating high-quality speech efficiently remains a key challenge for generative models in speech synthesis. This paper introduces VQalAttent, a lightweight model designed to generate fake speech with tunable performance and interpretability. Leveraging the AudioMNIST dataset, consisting of human utterances of decimal digits (0-9), our method employs a two-step architecture: first, a scalable vector quantized autoencoder (VQ-VAE) that compresses audio spectrograms into discrete latent representations, and second, a decoder-only transformer that learns the probability model of these latents. Trained transformer generates similar latent sequences, convertible to audio spectrograms by the VQ-VAE decoder, from which we generate fake utterances. Interpreting statistical and perceptual quality of the fakes, depending on the dimension and the extrinsic information of the latent space, enables guided improvements in larger, commercial generative models. As a valuable tool for understanding and refining audio synthesis, our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources, while the modularity and transparency of the training pipeline helps easily correlate the analytics with modular modifications, hence providing insights for the more complex models.

Paper Structure

This paper contains 12 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Architecture of VQVAE. The dimension and size of the codebook match the configuration of VQalAttent - Case 2.
  • Figure 2: Architecture of VQalAttent and data flow in the training and inference phase: training latent sequences are denoted $Z_Q,$ training output of the transformer by $\widehat{Z_Q}$ and the autoregressively generated fake by $\widecheck{Z_Q}.$
  • Figure 3: Architecture of the classifier. Convolutional layers are specified by the following format: Conv2D(input shape, output shape, stride), fully connected layers are specified by the following format: MLP(number of neurons).
  • Figure 4: Deep fakes created by $T^C_{1,2}$ for the class of $1$.
  • Figure 5: Original and VQ-VAE reconstructed spectrograms.
  • ...and 1 more figures