Table of Contents
Fetching ...

Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting

Louis Sharrock, Daniel Dodd, Christopher Nemeth

TL;DR

Two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free and another which is entirely learning rate free, based on coin betting techniques from convex optimization.

Abstract

We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.

Tuning-Free Maximum Likelihood Training of Latent Variable Models via Coin Betting

TL;DR

Two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free and another which is entirely learning rate free, based on coin betting techniques from convex optimization.

Abstract

We introduce two new particle-based algorithms for learning latent variable models via marginal maximum likelihood estimation, including one which is entirely tuning-free. Our methods are based on the perspective of marginal maximum likelihood estimation as an optimization problem: namely, as the minimization of a free energy functional. One way to solve this problem is via the discretization of a gradient flow associated with the free energy. We study one such approach, which resembles an extension of Stein variational gradient descent, establishing a descent lemma which guarantees that the free energy decreases at each iteration. This method, and any other obtained as the discretization of the gradient flow, necessarily depends on a learning rate which must be carefully tuned by the practitioner in order to ensure convergence at a suitable rate. With this in mind, we also propose another algorithm for optimizing the free energy which is entirely learning rate free, based on coin betting techniques from convex optimization. We validate the performance of our algorithms across several numerical experiments, including several high-dimensional settings. Our results are competitive with existing particle-based methods, without the need for any hyperparameter tuning.
Paper Structure (38 sections, 7 theorems, 84 equations, 20 figures, 3 algorithms)

This paper contains 38 sections, 7 theorems, 84 equations, 20 figures, 3 algorithms.

Key Result

Theorem 1

Assume that Assumptions assumption:bounded_k - assumption:bounded_I_stein hold. Suppose ${0<\gamma\leq\gamma_{*}}$, where Then, for all $t\geq 0$, there exist positive constants $A_1,A_2>0$, given in Appendix app:proof_descent_lemma, such that

Figures (20)

  • Figure 1: Results for the toy hierarchical model. MSE of the parameter estimate $\theta_{t}$ as a function of the learning rate after $T=500$ iterations (a); and MSE of the parameter estimate (b) and the posterior mean (c) as a function of the number of iterations, using the optimal learning rate from (a).
  • Figure 2: Additional results for the toy hierarchical model. Estimates for the posterior variance in the case $d_z=1$ obtained using (a) Coin EM and (b) SVGD EM, as a function of the number of iterations. In (c), we plot the MSE of the posterior variance estimate as a function of the learning rate, for Coin EM, SVGD EM, PGD, and SOUL, after $T=250$ iterations and with $N=50$ particles.
  • Figure 3: Results for the Bayesian logistic regression. Plots of (a) the sequence of parameter estimates $\theta_{t}$ initialized at zero, (b) the kernel density estimate of four components of the posterior approximation $\hat{\mu}_{800}^n = \frac{1}{n}\sum_{j=1}^n\delta_{z_{800}^{j}}$, (c) the test error as a function of the learning rate.
  • Figure 4: Results for the Bayesian neural network model. Test error over $T=500$ training iterations, for different $N$. For all learning-rate dependent methods, we use the best learning rate as determined by Fig. \ref{['fig:bnn_mnist_compare_lr']}.
  • Figure 5: Results for the latent space network model. Mean of the particles $\{z_{T}^i\}_{i=1}^N$ output by Coin EM afer $T=500$ iterations. Each node of the network represents a Game of Thrones character.
  • ...and 15 more figures

Theorems & Definitions (20)

  • Remark 1
  • Theorem 1
  • Theorem 2
  • Proposition 1
  • proof
  • Remark 2
  • Proposition 2
  • proof
  • Remark 3
  • Proposition 3
  • ...and 10 more