Table of Contents
Fetching ...

Score-Based Training for Energy-Based TTS Models

Wanli Sun, Anton Ragni

TL;DR

The paper tackles the challenge of training energy-based models for text-to-speech when the normalisation term is intractable. It introduces score-based training approaches, notably sliced score matching (SSM) and a novel delta loss, to produce scores that are amenable to fast, first-order inference. Empirical results on LJSpeech show that these score-based methods outperform noise-contrastive estimation (NCE) across objective metrics such as MCD, log F0, and model-based perceptual scores, with delta loss offering strong subjective naturalness and computational efficiency. The work also links delta loss to flow matching, highlighting theoretical connections and practical implications for diffusion-style versus first-order inference in EBMs. Overall, score-based training enables more reliable and efficient energy-based TTS with competitive perceptual quality.

Abstract

Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.

Score-Based Training for Energy-Based TTS Models

TL;DR

The paper tackles the challenge of training energy-based models for text-to-speech when the normalisation term is intractable. It introduces score-based training approaches, notably sliced score matching (SSM) and a novel delta loss, to produce scores that are amenable to fast, first-order inference. Empirical results on LJSpeech show that these score-based methods outperform noise-contrastive estimation (NCE) across objective metrics such as MCD, log F0, and model-based perceptual scores, with delta loss offering strong subjective naturalness and computational efficiency. The work also links delta loss to flow matching, highlighting theoretical connections and practical implications for diffusion-style versus first-order inference in EBMs. Overall, score-based training enables more reliable and efficient energy-based TTS with competitive perceptual quality.

Abstract

Noise contrastive estimation (NCE) is a popular method for training energy-based models (EBM) with intractable normalisation terms. The key idea of NCE is to learn by comparing unnormalised log-likelihoods of the reference and noisy samples, thus avoiding explicitly computing normalisation terms. However, NCE critically relies on the quality of noisy samples. Recently, sliced score matching (SSM) has been popularised by closely related diffusion models (DM). Unlike NCE, SSM learns a gradient of log-likelihood, or score, by learning distribution of its projections on randomly chosen directions. However, both NCE and SSM disregard the form of log-likelihood function, which is problematic given that EBMs and DMs make use of first-order optimisation during inference. This paper proposes a new criterion that learns scores more suitable for first-order schemes. Experiments contrasts these approaches for training EBMs.

Paper Structure

This paper contains 16 sections, 10 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Two ways of computing scores for EBMs: analytic (top) and predictive (bottom)
  • Figure 2: Different score functions: from left to right, the hypotheses list from the worst to the best.
  • Figure 3: Detailed breakdown of MOS score counts