Table of Contents
Fetching ...

Stochastic Variational Inference with Tuneable Stochastic Annealing

John Paisley, Ghazal Fazelnia, Brian Barr

TL;DR

A modified SVI approach -- applicable to both large and small datasets -- that allows the amount of annealing done by SVI to be tuned and an approximation to the maximum entropy stochastic gradient at a desired variance level is proposed.

Abstract

We exploit the observation that stochastic variational inference (SVI) is a form of annealing and present a modified SVI approach -- applicable to both large and small datasets -- that allows the amount of annealing done by SVI to be tuned. We are motivated by the fact that, in SVI, the larger the batch size the more approximately Gaussian is the noise of the gradient, but the smaller its variance, which reduces the amount of annealing done to escape bad local optimal solutions. We propose a simple method for achieving both goals of having larger variance noise to escape bad local optimal solutions and more data information to obtain more accurate gradient directions. The idea is to set an actual batch size, which may be the size of the data set, and an effective batch size that matches the increased variance of a smaller batch size. The result is an approximation to the maximum entropy stochastic gradient at a desired variance level. We theoretically motivate our ``SVI+'' approach for conjugate exponential family model framework and illustrate its empirical performance for learning the probabilistic matrix factorization collaborative filter (PMF), the Latent Dirichlet Allocation topic model (LDA), and the Gaussian mixture model (GMM).

Stochastic Variational Inference with Tuneable Stochastic Annealing

TL;DR

A modified SVI approach -- applicable to both large and small datasets -- that allows the amount of annealing done by SVI to be tuned and an approximation to the maximum entropy stochastic gradient at a desired variance level is proposed.

Abstract

We exploit the observation that stochastic variational inference (SVI) is a form of annealing and present a modified SVI approach -- applicable to both large and small datasets -- that allows the amount of annealing done by SVI to be tuned. We are motivated by the fact that, in SVI, the larger the batch size the more approximately Gaussian is the noise of the gradient, but the smaller its variance, which reduces the amount of annealing done to escape bad local optimal solutions. We propose a simple method for achieving both goals of having larger variance noise to escape bad local optimal solutions and more data information to obtain more accurate gradient directions. The idea is to set an actual batch size, which may be the size of the data set, and an effective batch size that matches the increased variance of a smaller batch size. The result is an approximation to the maximum entropy stochastic gradient at a desired variance level. We theoretically motivate our ``SVI+'' approach for conjugate exponential family model framework and illustrate its empirical performance for learning the probabilistic matrix factorization collaborative filter (PMF), the Latent Dirichlet Allocation topic model (LDA), and the Gaussian mixture model (GMM).

Paper Structure

This paper contains 13 sections, 1 theorem, 27 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

Let $M$ be the effective batch size of SVI+ and write the actual batch size as $|\mathcal{S}| = \tau M$, $\tau \geq 1$, where $\mathcal{S}$ is an index set selected uniformly iid from $\{1,\dots,N\}$. Let the SVI+ gradient $Y = \xi + |\mathcal{S}|^{-1}\sum_{n\in\mathcal{S}}\lambda_n$ where $\xi \sim where $d$ is the dimensionality of $Y$ and $Z$, and $C$ is a constant.

Figures (4)

  • Figure 1: Variational objective as a function of iteration for the probabilistic matrix factorization collaborative filter model. We use both the 1M and 10M MovieLens data sets and average over 20 random initializations. In all experiments, the batch size $|\mathcal{S}_t|$ equals the data size, with the difference of SVI+ being the effective batch size used for stochastically annealing the gradients. SVI+ finds a better local optimal solution and converges significantly faster.
  • Figure 2: SVI+ compared with SVI for the LDA model averaged over 10 runs as a function of 1500 stochastic gradient steps. SVI+ with $|\mathcal{S}|=1000$ and $M=50,100$ finds better local optimal solutions than SVI using any of these batch sizes. This indicates that the annealing done by SVI+ at effective level of $M=50,100$ performs better than using SVI with an actual batch size of 50 or 100. Also observed is that increasing the batch size of SVI degrades performance, possibly due to reduced annealing from less stochasticity in the gradients.
  • Figure 3: VI objective as a function of iteration for a typical run. Left: Synthetic data. Right: Pima dataset with four different effective batch sizes $M$.
  • Figure 4: TOP: Average size-sorted empirical distribution of data across 50 clusters. The ground truth number of clusters was 4. Batch inference consistently over estimates this number, while SVI oversimplifies the model by underestimating it. BOTTOM: Box plot version of top (SVI vs SVI+) showing cumulative percentage of data contained up to the given cluster (150 runs). SVI (blue) puts more data in fewer clusters than SVI+ (red).

Theorems & Definitions (3)

  • Definition 1: SVI+
  • Theorem 1: Max Entropy Gradients
  • proof : Proof (sketch)