Table of Contents
Fetching ...

Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning

Wenbo He, Zhijian Ou

TL;DR

This paper introduces Joint-stochastic-approximation autoencoders (JAEs) for semi-supervised learning, addressing two key gaps in deep generative models: effective handling of discrete observations/latents and learning criteria that directly target data likelihood. JAEs couple a generative model pθ(x,h) with an inference model qφ(h|x) and optimize them via stochastic approximation to maximize the data log-likelihood while minimizing the inclusive KL divergence KL(pθ(h|x) || qφ(h|x)), enabling stable training even with discrete variables and various encoder–decoder structures. The semi-supervised extension incorporates labels through pθ(x,y,h) and qφ(y,h|x), using labeled data to guide the discriminator-like term and maintaining efficient posterior sampling via MIS. Empirically, JAEs perform robustly across synthetic tasks (factor analysis, GMMs, sequences) and achieve competitive SSL performance on MNIST and SVHN with discrete latent spaces, demonstrating the first successful application of discrete latent variable models to challenging semi-supervised tasks. This work provides a new optimization paradigm for DGMs in SSL and highlights the practical viability of discrete latent representations for high-performance semi-supervised learning.

Abstract

Our examination of existing deep generative models (DGMs), including VAEs and GANs, reveals two problems. First, their capability in handling discrete observations and latent codes is unsatisfactory, though there are interesting efforts. Second, both VAEs and GANs optimize some criteria that are indirectly related to the data likelihood. To address these problems, we formally present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models, with application to semi-supervised learning. The JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence the between the posteriori and the inference model. We provide theoretical results and conduct a series of experiments to show its superiority such as being robust to structure mismatch between encoder and decoder, consistent handling of both discrete and continuous variables. Particularly we empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks over the widely adopted datasets - MNIST and SVHN. To the best of our knowledge, this is the first demonstration that discrete latent variable models are successfully applied in the challenging semi-supervised tasks.

Joint-stochastic-approximation Autoencoders with Application to Semi-supervised Learning

TL;DR

This paper introduces Joint-stochastic-approximation autoencoders (JAEs) for semi-supervised learning, addressing two key gaps in deep generative models: effective handling of discrete observations/latents and learning criteria that directly target data likelihood. JAEs couple a generative model pθ(x,h) with an inference model qφ(h|x) and optimize them via stochastic approximation to maximize the data log-likelihood while minimizing the inclusive KL divergence KL(pθ(h|x) || qφ(h|x)), enabling stable training even with discrete variables and various encoder–decoder structures. The semi-supervised extension incorporates labels through pθ(x,y,h) and qφ(y,h|x), using labeled data to guide the discriminator-like term and maintaining efficient posterior sampling via MIS. Empirically, JAEs perform robustly across synthetic tasks (factor analysis, GMMs, sequences) and achieve competitive SSL performance on MNIST and SVHN with discrete latent spaces, demonstrating the first successful application of discrete latent variable models to challenging semi-supervised tasks. This work provides a new optimization paradigm for DGMs in SSL and highlights the practical viability of discrete latent representations for high-performance semi-supervised learning.

Abstract

Our examination of existing deep generative models (DGMs), including VAEs and GANs, reveals two problems. First, their capability in handling discrete observations and latent codes is unsatisfactory, though there are interesting efforts. Second, both VAEs and GANs optimize some criteria that are indirectly related to the data likelihood. To address these problems, we formally present Joint-stochastic-approximation (JSA) autoencoders - a new family of algorithms for building deep directed generative models, with application to semi-supervised learning. The JSA learning algorithm directly maximizes the data log-likelihood and simultaneously minimizes the inclusive KL divergence the between the posteriori and the inference model. We provide theoretical results and conduct a series of experiments to show its superiority such as being robust to structure mismatch between encoder and decoder, consistent handling of both discrete and continuous variables. Particularly we empirically show that JSA autoencoders with discrete latent space achieve comparable performance to other state-of-the-art DGMs with continuous latent space in semi-supervised tasks over the widely adopted datasets - MNIST and SVHN. To the best of our knowledge, this is the first demonstration that discrete latent variable models are successfully applied in the challenging semi-supervised tasks.

Paper Structure

This paper contains 19 sections, 2 theorems, 8 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

If Eq.(eq:JSA_unsup_gradient) is solvable, then we can apply the SA algorithm to find its root.

Figures (10)

  • Figure 1: Results for factor analysis. Upper: KL divergences between $p_{\theta}(h|x)$ and $q_{\phi}(h|x)$ during training. Lower: KL divergences between the oracle $p_0(x)$ and the estimated $p_{\theta}(x)$ during training.
  • Figure 2: Comparison of a VAE with 2d Gaussian prior for latent code $h$ (row 1), a JAE with 2d Gaussian prior (row 2) and a JAE with a mixture of 4d Bernoulli and 1d Gaussian prior (row 3).
  • Figure 3: Column 1: Part of the training data from the context free grammar; Column 2/3/4 : data generated by JAE, GAN and VAE respectively. Both GAN and VAE use the Gumbel-softmax trick. The GAN result is copied from gumbelgan. The temperature $\{0.1,0.01,0.001\}$ is tested for Gumbel-softmax with VAE.
  • Figure 4: Conditional generation by the semi-supervised JAE over the MNIST dataset, using 60d Bernoulli prior. The leftmost column shows images from the test set. The other columns are generated by varying class label y for each column, and keeping latent h inferred from the leftmost column.
  • Figure 5: Class-conditional traversal in the discrete latent space. The center images in the two pictures are the reconstructions. Surrounding images are generated with several units of the latent codes flipped randomly. The number of flipped units follows the board distance to the center.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof