Table of Contents
Fetching ...

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung

TL;DR

FreGrad tackles the computational burden of diffusion vocoders by operating in a compact, wavelet-based feature space. It introduces a frequency-aware dilated convolution (Freq-DConv) and a bag of tricks, including separate priors per wavelet band, a zero-SNR noise schedule, and a multi-resolution STFT magnitude loss, to preserve spectral fidelity with few diffusion steps. Empirically, FreGrad achieves ~2.2x faster inference and ~3.7x faster training with only 1.78M parameters, while maintaining MOS and spectral metrics close to state-of-the-art baselines. Ablation studies confirm that wavelet-based denoising, Freq-DConv, priors, and loss terms each contribute to improved audio quality, supporting practical deployment on edge devices. Overall, FreGrad demonstrates that careful architectural design and loss engineering can yield high-quality diffusion vocoding at a fraction of the computational cost.

Abstract

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

TL;DR

FreGrad tackles the computational burden of diffusion vocoders by operating in a compact, wavelet-based feature space. It introduces a frequency-aware dilated convolution (Freq-DConv) and a bag of tricks, including separate priors per wavelet band, a zero-SNR noise schedule, and a multi-resolution STFT magnitude loss, to preserve spectral fidelity with few diffusion steps. Empirically, FreGrad achieves ~2.2x faster inference and ~3.7x faster training with only 1.78M parameters, while maintaining MOS and spectral metrics close to state-of-the-art baselines. Ablation studies confirm that wavelet-based denoising, Freq-DConv, priors, and loss terms each contribute to improved audio quality, supporting practical deployment on edge devices. Overall, FreGrad demonstrates that careful architectural design and loss engineering can yield high-quality diffusion vocoding at a fraction of the computational cost.

Abstract

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.
Paper Structure (11 sections, 10 equations, 5 figures, 2 tables)

This paper contains 11 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: FreGrad successfully reduces both real-time factor and the number of parameters while maintaining the synthetic quality.
  • Figure 2: Training procedure and model architecture of FreGrad. We compute wavelet features $\{\boldsymbol{x}^l$, $\boldsymbol{x}^h\}$ and prior distributions $\{\boldsymbol{\sigma}^l$, $\boldsymbol{\sigma}^h\}$ from waveform $\boldsymbol{x}$ and mel-spectrogram $\boldsymbol{X}$, respectively. At timestep $t$, noises $\{\boldsymbol{\epsilon}^l$, $\boldsymbol{\epsilon}^h\}$ are added to each wavelet feature. Given mel-spectrogram and timestep embedding, FreGrad approximates the noises $\{\boldsymbol{\hat{\epsilon}}^l$, $\boldsymbol{\hat{\epsilon}}^h\}$. The training objective is a weighted sum of $\mathcal{L}_{diff}$ and $\mathcal{L}_{mag}$ between ground truth and the predicted noise.
  • Figure 3: Frequency-aware dilated convolution.
  • Figure 4: Noise level and log SNR through timesteps. "Baselines" refer to the work of ICLR:2021:DiffWaveDBLP:conf/iclr/ChenZZWNC21ICLR:2022:PriorGrad which use the same linear beta schedule $\boldsymbol\beta$ ranging from $0.0001$ to $0.05$ for 50 diffusion steps.
  • Figure 5: Spectrogram analysis on FreGrad and PriorGrad. While PriorGrad suffers from over-smoothed results, FreGrad reproduces detailed spectral correlation, especially in red boxes.