BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Taesoo Park; Mungwi Jeong; Mingyu Park; Narae Kim; Junyoung Kim; Mujung Kim; Jisang Yoo; Hoyun Lee; Sanghoon Kim; Soonchul Kwon

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon

TL;DR

BemaGANv2 offers a unified GAN-based vocoder framework tailored for long-term audio generation by integrating an AMP-based generator with a dual-discriminator setup (MED and MRD). The Snake activation provides a learnable periodic bias to improve harmonic modeling, while MED enhances envelope-aware temporal fidelity and MRD ensures spectral precision across scales. Across extensive objective and subjective evaluations on LJSpeech and Freesound, BemaGANv2 consistently outperforms HiFi-GAN and BigVGAN, particularly in long-form audio, and reveals critical insights into discriminator configurations and activation-function effects on stability. The work contributes a clear tutorial-style survey, a reproducible implementation guide, and practical findings that inform design choices for TTM/TTA pipelines and future multimodal audio generation systems.

Abstract

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

TL;DR

Abstract

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)