BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation
Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon
TL;DR
BemaGANv2 offers a unified GAN-based vocoder framework tailored for long-term audio generation by integrating an AMP-based generator with a dual-discriminator setup (MED and MRD). The Snake activation provides a learnable periodic bias to improve harmonic modeling, while MED enhances envelope-aware temporal fidelity and MRD ensures spectral precision across scales. Across extensive objective and subjective evaluations on LJSpeech and Freesound, BemaGANv2 consistently outperforms HiFi-GAN and BigVGAN, particularly in long-form audio, and reveals critical insights into discriminator configurations and activation-function effects on stability. The work contributes a clear tutorial-style survey, a reproducible implementation guide, and practical findings that inform design choices for TTM/TTA pipelines and future multimodal audio generation systems.
Abstract
This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.
