Table of Contents
Fetching ...

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

Taesoo Park, Mungwi Jeong, Mingyu Park, Narae Kim, Junyoung Kim, Mujung Kim, Jisang Yoo, Hoyun Lee, Sanghoon Kim, Soonchul Kwon

TL;DR

BemaGANv2 offers a unified GAN-based vocoder framework tailored for long-term audio generation by integrating an AMP-based generator with a dual-discriminator setup (MED and MRD). The Snake activation provides a learnable periodic bias to improve harmonic modeling, while MED enhances envelope-aware temporal fidelity and MRD ensures spectral precision across scales. Across extensive objective and subjective evaluations on LJSpeech and Freesound, BemaGANv2 consistently outperforms HiFi-GAN and BigVGAN, particularly in long-form audio, and reveals critical insights into discriminator configurations and activation-function effects on stability. The work contributes a clear tutorial-style survey, a reproducible implementation guide, and practical findings that inform design choices for TTM/TTA pipelines and future multimodal audio generation systems.

Abstract

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.

BemaGANv2: A Tutorial and Comparative Survey of GAN-based Vocoders for Long-Term Audio Generation

TL;DR

BemaGANv2 offers a unified GAN-based vocoder framework tailored for long-term audio generation by integrating an AMP-based generator with a dual-discriminator setup (MED and MRD). The Snake activation provides a learnable periodic bias to improve harmonic modeling, while MED enhances envelope-aware temporal fidelity and MRD ensures spectral precision across scales. Across extensive objective and subjective evaluations on LJSpeech and Freesound, BemaGANv2 consistently outperforms HiFi-GAN and BigVGAN, particularly in long-form audio, and reveals critical insights into discriminator configurations and activation-function effects on stability. The work contributes a clear tutorial-style survey, a reproducible implementation guide, and practical findings that inform design choices for TTM/TTA pipelines and future multimodal audio generation systems.

Abstract

This paper presents a tutorial-style survey and implementation guide of BemaGANv2, an advanced GANbased vocoder designed for high-fidelity and long-term audio generation. Long-term audio generation is critical for applications in Text-to-Music (TTM) and Text-to-Audio (TTA) systems, where maintaining temporal coherence, prosodic consistency, and harmonic structure over extended durations remains a significant challenge. Built upon the original BemaGAN architecture, BemaGANv2 incorporates major architectural innovations by replacing traditional ResBlocks in the generator with the Anti-aliased Multi-Periodicity composition (AMP) module, which internally applies the Snake activation function to better model periodic structures. In the discriminator framework, we integrate the Multi-Envelope Discriminator (MED), a novel architecture we proposed, to extract rich temporal envelope features crucial for periodicity detection. Coupled with the Multi-Resolution Discriminator (MRD), this combination enables more accurate modeling of long-range dependencies in audio. We systematically evaluate various discriminator configurations, including Multi-Scale Discriminator (MSD) + MED, MSD + MRD, and Multi-Period Discriminator (MPD) + MED + MRD, using objective metrics (Fréchet Audio Distance (FAD), Structural Similarity Index (SSIM), Pearson Correlation Coefficient (PCC), Mel-Cepstral Distortion (MCD)) and subjective evaluations (MOS, SMOS). This paper also provides a comprehensive tutorial on the model architecture, training methodology, and implementation to promote reproducibility. The code and pre-trained models are available at: https://github.com/dinhoitt/BemaGANv2.

Paper Structure

This paper contains 42 sections, 10 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparison of activation functions (each in separate axis).
  • Figure 2: The architecture of the AMP-based generator used in BemaGANv2. The AMP block, originally introduced in BigVGAN, integrates upsampling and downsampling operations with low-pass filtering (LPF), Snake activation for periodic inductive bias, and dilated convolutions.
  • Figure 3: The structure of the Multi-Envelope Discriminator (MED). Time-domain envelopes, including both upper and lower envelopes, are extracted from the input audio using low-pass filters with different cutoff frequencies. These envelope signals are then processed by 1D convolutional layers. This design enables the discriminator to detect temporal energy patterns that are crucial for perceptual quality, such as prosodic variation.
  • Figure 4: Overview of the BemaGANv2 architecture. The generator converts Mel-spectrograms into raw waveforms, which are evaluated by two discriminators: the Multi-Envelope Discriminator (MED) and the Multi-Resolution Discriminator (MRD). Adversarial and auxiliary losses are backpropagated from both discriminators to train the generator.
  • Figure 5: Mel-Spectrogram visualization of samples from Ground Truth, BigVGAN, and BemaGANv2 trained on LJSpeech.
  • ...and 3 more figures