Table of Contents
Fetching ...

Cooperative Training of Descriptor and Generator Networks

Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu

TL;DR

CoopNets introduces a cooperative framework to jointly train a bottom-up descriptor energy-based network and a top-down generator latent-variable network using MCMC teaching. The generator provides initial synthesized samples that the descriptor refines via finite-step Langevin dynamics, while the descriptor's revisions guide the generator to reproduce those refinements, effectively unifying energy-based and latent-variable learning. Across textures, objects, scenes, digits, and dynamic textures, CoopNets yields highly realistic synthesis and robust pattern completion, often outperforming GAN- and VAE-based baselines. The approach offers a new perspective on combining undirected and directed models and suggests natural extensions to conditional generation.

Abstract

This paper studies the cooperative training of two generative models for image modeling and synthesis. Both models are parametrized by convolutional neural networks (ConvNets). The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthesized examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthesized examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

Cooperative Training of Descriptor and Generator Networks

TL;DR

CoopNets introduces a cooperative framework to jointly train a bottom-up descriptor energy-based network and a top-down generator latent-variable network using MCMC teaching. The generator provides initial synthesized samples that the descriptor refines via finite-step Langevin dynamics, while the descriptor's revisions guide the generator to reproduce those refinements, effectively unifying energy-based and latent-variable learning. Across textures, objects, scenes, digits, and dynamic textures, CoopNets yields highly realistic synthesis and robust pattern completion, often outperforming GAN- and VAE-based baselines. The approach offers a new perspective on combining undirected and directed models and suggests natural extensions to conditional generation.

Abstract

This paper studies the cooperative training of two generative models for image modeling and synthesis. Both models are parametrized by convolutional neural networks (ConvNets). The first model is a deep energy-based model, whose energy function is defined by a bottom-up ConvNet, which maps the observed image to the energy. We call it the descriptor network. The second model is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed image. The maximum likelihood learning algorithms of both models involve MCMC sampling such as Langevin dynamics. We observe that the two learning algorithms can be seamlessly interwoven into a cooperative learning algorithm that can train both models simultaneously. Specifically, within each iteration of the cooperative learning algorithm, the generator model generates initial synthesized examples to initialize a finite-step MCMC that samples and trains the energy-based descriptor model. After that, the generator model learns from how the MCMC changes its synthesized examples. That is, the descriptor model teaches the generator model by MCMC, so that the generator model accumulates the MCMC transitions and reproduces them by direct ancestral sampling. We call this scheme MCMC teaching. We show that the cooperative algorithm can learn highly realistic generative models.

Paper Structure

This paper contains 21 sections, 25 equations, 17 figures, 6 tables, 3 algorithms.

Figures (17)

  • Figure 1: The flow chart of Algorithm D for training the descriptor network. The updating in Step D2 is based on the difference between the observed examples and the synthesized examples. The Langevin sampling of the synthesized examples from the current model in Step D1 can be time consuming.
  • Figure 2: The flow chart of Algorithm G for training the generator network. The updating in Step G2 is based on the observed examples and their inferred latent factors. The Langevin sampling of the latent factors from the current posterior distribution in Step G1 can be time consuming.
  • Figure 3: The flow chart of the cooperative algorithm. The part of the flow chart for training the descriptor is similar to Algorithm D in Figure \ref{['fig:diagramD']}, except that the D1 Langevin sampling is initialized from the initial synthesized examples supplied by the generator. The part of the flow chart for training the generator can also be mapped to Algorithm G in Figure \ref{['fig:diagramG']}, except that the revised synthesized examples play the role of the observed examples, and the known generated latent factors can be used as inferred latent factors (or be used to initialize the G1 Langevin sampling of the latent factors).
  • Figure 4: The MCMC teaching of the generator alternates between Markov transition and projection. The family of the generator models ${\cal G}$ is illustrated by the black curve. Each distribution is illustrated by a point.
  • Figure 5: Generating texture patterns. Each row displays one texture experiment, where the first image is the training image, and the rest are 3 of the images generated by the CoopNets algorithm. The observed and synthesized images are of size 224 $\times$ 224 pixels.
  • ...and 12 more figures