Table of Contents
Fetching ...

Improved Contrastive Divergence Training of Energy Based Models

Yilun Du, Shuang Li, Joshua Tenenbaum, Igor Mordatch

TL;DR

The paper tackles instability in training energy-based models with contrastive divergence. It reintroduces a previously neglected KL-gradient term and shows how to estimate it efficiently, combining Langevin dynamics with a nearest-neighbor entropy surrogate. It also introduces data augmentation transitions and a multi-scale energy formulation to improve mixing, robustness, and generation quality. Empirically, these components yield improved stability and performance on image generation, out-of-distribution detection, and compositional generation.

Abstract

Contrastive divergence is a popular method of training energy-based models, but is known to have difficulties with training stability. We propose an adaptation to improve contrastive divergence training by scrutinizing a gradient term that is difficult to calculate and is often left out for convenience. We show that this gradient term is numerically significant and in practice is important to avoid training instabilities, while being tractable to estimate. We further highlight how data augmentation and multi-scale processing can be used to improve model robustness and generation quality. Finally, we empirically evaluate stability of model architectures and show improved performance on a host of benchmarks and use cases,such as image generation, OOD detection, and compositional generation.

Improved Contrastive Divergence Training of Energy Based Models

TL;DR

The paper tackles instability in training energy-based models with contrastive divergence. It reintroduces a previously neglected KL-gradient term and shows how to estimate it efficiently, combining Langevin dynamics with a nearest-neighbor entropy surrogate. It also introduces data augmentation transitions and a multi-scale energy formulation to improve mixing, robustness, and generation quality. Empirically, these components yield improved stability and performance on image generation, out-of-distribution detection, and compositional generation.

Abstract

Contrastive divergence is a popular method of training energy-based models, but is known to have difficulties with training stability. We propose an adaptation to improve contrastive divergence training by scrutinizing a gradient term that is difficult to calculate and is often left out for convenience. We show that this gradient term is numerically significant and in practice is important to avoid training instabilities, while being tractable to estimate. We further highlight how data augmentation and multi-scale processing can be used to improve model robustness and generation quality. Finally, we empirically evaluate stability of model architectures and show improved performance on a host of benchmarks and use cases,such as image generation, OOD detection, and compositional generation.

Paper Structure

This paper contains 27 sections, 17 equations, 20 figures, 6 tables, 2 algorithms.

Figures (20)

  • Figure 1: (Left) 128x128 samples on unconditional CelebA-HQ. (Right) 128x128 samples on unconditional LSUN Bedroom.
  • Figure 2: Illustration of our overall proposed framework for training EBMs. EBMs are trained with contrastive divergence, where the energy function decreases energy of real data samples (green dot) and increases the energy of hallucinations (red dot). EBMs are further trained with a KL loss which encourages generated hallucinations (shown as a solid red ball) to have low underlying energy and high diversity (shown as blue balls). Red/green arrows indicate forward computation while dashed arrows indicate gradient backpropogation.
  • Figure 3: Illustration of our multi-scale EBM architecture. Our energy function over an image is defined compositionally as the sum of energy functions on different resolutions of an image.
  • Figure 4: Randomly selected unconditional 128x128 CelebA-HQ images generated from our trained EBM model. Samples are relatively diverse with limited artifacts.
  • Figure 5: Visualization of Langevin dynamics sampling chains on an EBM trained on CelebA-HQ 128x128. Samples travel between different modes of images. Each consecutive images represents 30 steps of sampling, with data augmentation transitions every 60 steps.
  • ...and 15 more figures