Score-Based Training for Energy-Based TTS Models
Wanli Sun, Anton Ragni
TL;DR
The paper tackles the challenge of training energy-based models for text-to-speech when the normalisation term is intractable. It introduces score-based training approaches, notably sliced score matching (SSM) and a novel delta loss, to produce scores that are amenable to fast, first-order inference. Empirical results on LJSpeech show that these score-based methods outperform noise-contrastive estimation (NCE) across objective metrics such as MCD, log F0, and model-based perceptual scores, with delta loss offering strong subjective naturalness and computational efficiency. The work also links delta loss to flow matching, highlighting theoretical connections and practical implications for diffusion-style versus first-order inference in EBMs. Overall, score-based training enables more reliable and efficient energy-based TTS with competitive perceptual quality.
Abstract
Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.
