FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder
Tan Dat Nguyen, Ji-Hoon Kim, Youngjoon Jang, Jaehun Kim, Joon Son Chung
TL;DR
FreGrad tackles the computational burden of diffusion vocoders by operating in a compact, wavelet-based feature space. It introduces a frequency-aware dilated convolution (Freq-DConv) and a bag of tricks, including separate priors per wavelet band, a zero-SNR noise schedule, and a multi-resolution STFT magnitude loss, to preserve spectral fidelity with few diffusion steps. Empirically, FreGrad achieves ~2.2x faster inference and ~3.7x faster training with only 1.78M parameters, while maintaining MOS and spectral metrics close to state-of-the-art baselines. Ablation studies confirm that wavelet-based denoising, Freq-DConv, priors, and loss terms each contribute to improved audio quality, supporting practical deployment on edge devices. Overall, FreGrad demonstrates that careful architectural design and loss engineering can yield high-quality diffusion vocoding at a fraction of the computational cost.
Abstract
The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. Our framework consists of the following three key components: (1) We employ discrete wavelet transform that decomposes a complicated waveform into sub-band wavelets, which helps FreGrad to operate on a simple and concise feature space, (2) We design a frequency-aware dilated convolution that elevates frequency awareness, resulting in generating speech with accurate frequency information, and (3) We introduce a bag of tricks that boosts the generation quality of the proposed model. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality. Audio samples are available at: https://mm.kaist.ac.kr/projects/FreGrad.
