Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka
TL;DR
This paper tackles the data scarcity problem in GAN-based neural vocoders by introducing Augmentation-Conditional Discriminator (AugCondD), which conditions the discriminator on the augmentation state to prevent augmented samples from distorting the learning of the original speech distribution. AugCondD is integrated into a GAN-based vocoder framework and uses augmentation state $\mu$ derived from mixup-based data augmentation, with input concatenation to the discriminator ensuring the model learns both augmented and non-augmented distributions. Empirical results on LJSpeech show that AugCondD substantially improves speech quality under limited data while remaining competitive with strong baselines under full data; its general utility is demonstrated across different network architectures, augmentation methods (e.g., mixup and speaking-rate changes), and speakers. The approach is simple to implement and can be extended to end-to-end models and fine-tuning, offering a practical pathway to high-quality vocoding with limited data and reduced data-collection costs.
Abstract
A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solution is to augment the training data to avoid overfitting. However, a standard discriminator is unconditional and insensitive to distributional changes caused by data augmentation. Thus, augmented speech (which can be extraordinary) may be considered real speech. To address this issue, we propose an augmentation-conditional discriminator (AugCondD) that receives the augmentation state as input in addition to speech, thereby assessing the input speech according to the augmentation state, without inhibiting the learning of the original non-augmented distribution. Experimental results indicate that AugCondD improves speech quality under limited data conditions while achieving comparable speech quality under sufficient data conditions. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/.
