Table of Contents
Fetching ...

WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching

Tianze Luo, Xingchen Miao, Wenbo Duan

TL;DR

WaveFM introduces a high-fidelity vocoder that reparameterizes flow matching for mel-conditioned speech synthesis by using a mel-informed prior to reduce transport costs and by directly predicting waveforms with a reparameterized objective. It augments the training objective with a multi-resolution STFT-based phase/gradient/Laplacian loss and employs a tailored consistency distillation scheme to achieve one-step waveform generation without significantly harming quality. The approach combines an asymmetric U-Net with snake activations and large receptive fields, guided by mel-spectrogram conditioning, and leverages a mel-conditioned diagonal prior for stability. Empirical results on LibriTTS and MUSDB18-HQ show WaveFM surpasses prior diffusion vocoders in both sample quality and inference speed, while enabling fast one-shot generation and maintaining generalization to out-of-distribution musical data.

Abstract

Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching

TL;DR

WaveFM introduces a high-fidelity vocoder that reparameterizes flow matching for mel-conditioned speech synthesis by using a mel-informed prior to reduce transport costs and by directly predicting waveforms with a reparameterized objective. It augments the training objective with a multi-resolution STFT-based phase/gradient/Laplacian loss and employs a tailored consistency distillation scheme to achieve one-step waveform generation without significantly harming quality. The approach combines an asymmetric U-Net with snake activations and large receptive fields, guided by mel-spectrogram conditioning, and leverages a mel-conditioned diagonal prior for stability. Empirical results on LibriTTS and MUSDB18-HQ show WaveFM surpasses prior diffusion vocoders in both sample quality and inference speed, while enabling fast one-shot generation and maintaining generalization to out-of-distribution musical data.

Abstract

Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.

Paper Structure

This paper contains 23 sections, 38 equations, 3 figures, 4 tables, 2 algorithms.

Figures (3)

  • Figure 1: Spectrograms of a clean audio and audios generated by WaveFM-6 Steps using the original STFT loss and our proposed STFT loss, from left to right.
  • Figure 2: Network architecture. Conv1d and ConvTranspose1d are set with parameters (output channel, kernel width, dilation, padding).
  • Figure 3: Spectrograms of a music clip (Ground Truth, WaveFM-6 Steps, PriorGrad-6 Steps, BigVGAN-base)