Energy-Based Models with Applications to Speech and Language Processing

Zhijian Ou

Energy-Based Models with Applications to Speech and Language Processing

Zhijian Ou

TL;DR

This work surveys energy-based models (EBMs) with a focus on their application to speech and language processing. It covers the fundamentals of EBMs, including their undirected graphical model formulation, learning via maximum likelihood and alternatives like NCE, and generation via MCMC and gradient-based samplers. The text then details three application facets: modeling marginal distributions for language data, conditional EBMs for tasks like speech recognition and text generation, and joint EBMs that combine observations and labels for semi-supervised learning and calibrated natural language understanding. Across chapters, the monograph emphasizes sequential data, addresses biases inherent in locally normalized models, and presents modern neural-network parameterizations, as well as advanced training strategies such as DNCE, inclusive variational methods, and joint-training with auxiliary generators. The work also highlights practical generation challenges with EBMs and surveys methods such as TRF-LMs, GN-ELMs, residual EBMs, and mix-and-match decoding, underscoring EBMs’ potential for improved calibration, sample quality, and flexible, globally normalized modeling in NLP and ASR contexts.

Abstract

Energy-Based Models (EBMs) are an important class of probabilistic models, also known as random fields and undirected graphical models. EBMs are un-normalized and thus radically different from other popular self-normalized probabilistic models such as hidden Markov models (HMMs), autoregressive models, generative adversarial nets (GANs) and variational auto-encoders (VAEs). Over the past years, EBMs have attracted increasing interest not only from the core machine learning community, but also from application domains such as speech, vision, natural language processing (NLP) and so on, due to significant theoretical and algorithmic progress. The sequential nature of speech and language also presents special challenges and needs a different treatment from processing fix-dimensional data (e.g., images). Therefore, the purpose of this monograph is to present a systematic introduction to energy-based models, including both algorithmic progress and applications in speech and language processing. First, the basics of EBMs are introduced, including classic models, recent models parameterized by neural networks, sampling methods, and various learning methods from the classic learning algorithms to the most advanced ones. Then, the application of EBMs in three different scenarios is presented, i.e., for modeling marginal, conditional and joint distributions, respectively. 1) EBMs for sequential data with applications in language modeling, where the main focus is on the marginal distribution of a sequence itself; 2) EBMs for modeling conditional distributions of target sequences given observation sequences, with applications in speech recognition, sequence labeling and text generation; 3) EBMs for modeling joint distributions of both sequences of observations and targets, and their applications in semi-supervised learning and calibrated natural language understanding.

Energy-Based Models with Applications to Speech and Language Processing

TL;DR

Abstract

Paper Structure (168 sections, 11 theorems, 196 equations, 37 figures, 9 tables, 9 algorithms)

This paper contains 168 sections, 11 theorems, 196 equations, 37 figures, 9 tables, 9 algorithms.

Introduction
The probabilistic approach
Generative models and discriminative models
Conditional models
Features of EBMs
Organization of this monograph
Basics for EBMs
Probabilistic graphical models (PGMs)
Directed graphical models
Factorization and Markov properties in directed graphical models
DGM example - HMM
DGM example - Neural network based classifier
Undirected graphical models
Factorization and Markov properties in undirected graphical models
Energy-based models and Gibbs distributions
...and 153 more sections

Key Result

Theorem 2.1

Denote the target density as $p(z;\lambda)$ with given $\lambda$. Assume that one can compute a noisy, unbiased estimate $\Delta(z;\lambda)$ (a stochastic gradient) to the gradient $\frac{\partial}{\partial z} \log p(z;\lambda)$. For a sequence of asymptotically vanishing time-steps $\left\lbrace \d The iterations of Eq. (eq:SGLD) lead to the target distribution $p(z;\lambda)$ as the stationary di

Figures (37)

Figure 1: The probabilistic approach
Figure 2: Outline of this monograph
Figure 3: (a) A simple directed graphical model with four variables $(x_1, x_2, x_3, x_4)$. (b) A simple undirected graphical model with four variables $(x_1, x_2, x_3, x_4)$. For both types of graphs, $V$ denotes the set of nodes and $E$ the set of edges. If both ordered pairs $(\alpha, \beta)$ and $(\beta, \alpha)$ belong to $E$, we say that we have an undirected edge between $\alpha$ and $\beta$. A nice introduction of graph theory in the context of graphical models could be found in Chapter 4 of Cowell1999ProbabilisticNA.
Figure 4: Graphical model representation of a hidden Markov model (HMM).
Figure 5: Neural network based classifier. (a) GM representation; (b) Computational graph representation.
...and 32 more figures

Theorems & Definitions (22)

Definition 2.1: DGM
Definition 2.2: UGM
Definition 2.3: EBM
Definition 2.4: Log-linear model
Example 2.1: Word morphology
Definition 2.5: EBMs parameterized by neural networks
Theorem 2.1
Theorem 2.2
Theorem 2.3
proof
...and 12 more

Energy-Based Models with Applications to Speech and Language Processing

TL;DR

Abstract

Energy-Based Models with Applications to Speech and Language Processing

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (37)

Theorems & Definitions (22)