BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Takashi Shibuya; Yuhta Takida; Yuki Mitsufuji

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Takashi Shibuya, Yuhta Takida, Yuki Mitsufuji

TL;DR

This paper proposes a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN, and demonstrates that SAN can improve the performance of GANs, including BigVGAN, with small modifications.

Abstract

Generative adversarial network (GAN)-based vocoders have been intensively studied because they can synthesize high-fidelity audio waveforms faster than real-time. However, it has been reported that most GANs fail to obtain the optimal projection for discriminating between real and fake data in the feature space. In the literature, it has been demonstrated that slicing adversarial network (SAN), an improved GAN training framework that can find the optimal projection, is effective in the image generation task. In this paper, we investigate the effectiveness of SAN in the vocoding task. For this purpose, we propose a scheme to modify least-squares GAN, which most GAN-based vocoders adopt, so that their loss functions satisfy the requirements of SAN. Through our experiments, we demonstrate that SAN can improve the performance of GAN-based vocoders, including BigVGAN, with small modifications. Our code is available at https://github.com/sony/bigvsan.

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 2 figures, 2 tables)

This paper contains 11 sections, 5 equations, 2 figures, 2 tables.

Introduction
Related Work
Method
Overall Framework
SANs for Vocoder Training
From GANs to SANs
Soft Monotonization for Least-Squares SAN
Experiments
BigVSAN: Large-scale Vocoder Training
Moderate-sized Vocoder Training
Conclusion

Figures (2)

Figure 1: Comparisons of $R_i:\mathbb{R}\to\mathbb{R}$$(i=1,2,3)$. In (c), $R_3$ of least-squares GAN is increasing in the red shaded region, which is problematic for SAN due to the non-monotonicity. In contrast, $R_3$ of least-squares SAN is monotonically decreasing over the entire real number but keeps the shape of least-squares GAN to some extent in the blue shaded region.
Figure 2: Spectrograms of synthesized samples with BigVSAN trained on the LibriTTS train set for 1M steps and the corresponding ground truth.

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

TL;DR

Abstract

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Authors

TL;DR

Abstract

Table of Contents

Figures (2)