Training Energy-Based Models with Diffusion Contrastive Divergences

Weijian Luo; Hao Jiang; Tianyang Hu; Jiacheng Sun; Zhenguo Li; Zhihua Zhang

Training Energy-Based Models with Diffusion Contrastive Divergences

Weijian Luo, Hao Jiang, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Zhihua Zhang

TL;DR

The paper introduces Diffusion Contrastive Divergence (DCD), a framework that generalizes Contrastive Divergence by replacing EBM-induced Langevin dynamics with parameter-free diffusion processes. The authors prove that DCD is a valid KL-divergence-based objective, connect it to diffusion recovery likelihood and KL-contraction, and derive practical one-step estimators like DCD-VE for training EBMs. They further propose time-dependent EBMs trained via DCD and demonstrate in synthetic 2D tasks, image denoising, and CelebA32×32 generation that DCD improves efficiency and often outperforms CD while remaining competitive with existing generative approaches. The work highlights DCD as a unifying, analysis-friendly, and scalable approach for MCMC-free EBM training, with clear avenues for future exploration in long-time diffusion dynamics and diffusion choices.

Abstract

Energy-Based Models (EBMs) have been widely used for generative modeling. Contrastive Divergence (CD), a prevailing training objective for EBMs, requires sampling from the EBM with Markov Chain Monte Carlo methods (MCMCs), which leads to an irreconcilable trade-off between the computational burden and the validity of the CD. Running MCMCs till convergence is computationally intensive. On the other hand, short-run MCMC brings in an extra non-negligible parameter gradient term that is difficult to handle. In this paper, we provide a general interpretation of CD, viewing it as a special instance of our proposed Diffusion Contrastive Divergence (DCD) family. By replacing the Langevin dynamic used in CD with other EBM-parameter-free diffusion processes, we propose a more efficient divergence. We show that the proposed DCDs are both more computationally efficient than the CD and are not limited to a non-negligible gradient term. We conduct intensive experiments, including both synthesis data modeling and high-dimensional image denoising and generation, to show the advantages of the proposed DCDs. On the synthetic data learning and image denoising experiments, our proposed DCD outperforms CD by a large margin. In image generation experiments, the proposed DCD is capable of training an energy-based model for generating the Celab-A $32\times 32$ dataset, which is comparable to existing EBMs.

Training Energy-Based Models with Diffusion Contrastive Divergences

TL;DR

Abstract

dataset, which is comparable to existing EBMs.

Paper Structure (51 sections, 7 theorems, 86 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 51 sections, 7 theorems, 86 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Background
Energy-based models.
Contrastive divergence and the non-negligible gradient term.
Diffusion process.
Diffusion contrastive divergences
CD with general diffusions
Connections to existing methods
Connection to Diffusion Recovery Likelihood.
DCD as a KL-contraction divergence.
Evolution of the energy function
DCD-VE.
Train time-dependent EBM with DCD
Experiments
Energy modeling of 2D distributions
...and 36 more sections

Key Result

Theorem 2

Let $\bm{F}(\boldsymbol{x},t)$ and $\bm{G}(t)$ be two pre-defined functions. For two distributions $p$ and $q$, assume both $p,q$ evolve according to the same diffusion process equation form:1. Let $p^{(t)}$ and $q^{(t)}$ denote the time $t$ marginal distribution under SDE evolution. Then we have

Figures (4)

Figure 1: Illustration of DCD and CD. The yellow area represents the corresponding divergence. The CD takes the EBM-induced Langevin dynamics to transport data and EBM distribution to meet with the same EBM distribution. The DCD considers a more general diffusion process to transport both data and EBM distribution to meet with the same distribution.
Figure 2: Left: 2D examples when CD and PCD fails to learn a correct EBM but DCD-VE can learn successfully; Right: Generated CelebA $32$ samples from EBM trained with DCD-VE.
Figure 3: Comparison of different training methods.
Figure 4: The CD fails to denoise large added noise, while the DCD (VE) can denoise successfully.

Theorems & Definitions (15)

Definition 1: Diffusion Contrastive Divergence
Theorem 2
Proposition 1
Remark 1
Theorem 3
Proposition 2
Lemma 4
proof
Lemma 5
proof
...and 5 more

Training Energy-Based Models with Diffusion Contrastive Divergences

TL;DR

Abstract

Training Energy-Based Models with Diffusion Contrastive Divergences

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (15)