Table of Contents
Fetching ...

Dynamic Learning Rate Decay for Stochastic Variational Inference

Maximilian Dinkel, Gil Robalo Rei, Wolfgang A. Wall

TL;DR

Stochastic Variational Inference (SVI) often suffers from sensitivity to the learning rate, leading to slow convergence or oscillations when the base rate is too large. The authors propose Dynamic Learning Rate Decay (DLRD), a memory-efficient scheme that uses a signal-to-noise ratio (SNR) computed from the history of variational parameters to adaptively reduce the base learning rate, triggering decreases when oscillations outweigh progress. The method is shown to complement existing optimizers (e.g., Adam, AdaMax, RMSprop) and to improve convergence across a synthetic toy problem, Bayesian logistic regression on a breast-cancer dataset, and Bayesian calibration of a diffusivity field, all while reducing sensitivity to the initial LR and batch size. A noted limitation is that DLRD detects oscillations rather than divergence, suggesting future work to detect divergence and to extend to richer variational families beyond Gaussian.

Abstract

Like many optimization algorithms, Stochastic Variational Inference (SVI) is sensitive to the choice of the learning rate. If the learning rate is too small, the optimization process may be slow, and the algorithm might get stuck in local optima. On the other hand, if the learning rate is too large, the algorithm may oscillate or diverge, failing to converge to a solution. Adaptive learning rate methods such as Adam, AdaMax, Adagrad, or RMSprop automatically adjust the learning rate based on the history of gradients. Nevertheless, if the base learning rate is too large, the variational parameters might still oscillate around the optimal solution. With learning rate schedules, the learning rate can be reduced gradually to mitigate this problem. However, the amount at which the learning rate should be decreased in each iteration is not known a priori, which can significantly impact the performance of the optimization. In this work, we propose a method to decay the learning rate based on the history of the variational parameters. We use an empirical measure to quantify the amount of oscillations against the progress of the variational parameters to adapt the learning rate. The approach requires little memory and is computationally efficient. We demonstrate in various numerical examples that our method reduces the sensitivity of the optimization performance to the learning rate and that it can also be used in combination with other adaptive learning rate methods.

Dynamic Learning Rate Decay for Stochastic Variational Inference

TL;DR

Stochastic Variational Inference (SVI) often suffers from sensitivity to the learning rate, leading to slow convergence or oscillations when the base rate is too large. The authors propose Dynamic Learning Rate Decay (DLRD), a memory-efficient scheme that uses a signal-to-noise ratio (SNR) computed from the history of variational parameters to adaptively reduce the base learning rate, triggering decreases when oscillations outweigh progress. The method is shown to complement existing optimizers (e.g., Adam, AdaMax, RMSprop) and to improve convergence across a synthetic toy problem, Bayesian logistic regression on a breast-cancer dataset, and Bayesian calibration of a diffusivity field, all while reducing sensitivity to the initial LR and batch size. A noted limitation is that DLRD detects oscillations rather than divergence, suggesting future work to detect divergence and to extend to richer variational families beyond Gaussian.

Abstract

Like many optimization algorithms, Stochastic Variational Inference (SVI) is sensitive to the choice of the learning rate. If the learning rate is too small, the optimization process may be slow, and the algorithm might get stuck in local optima. On the other hand, if the learning rate is too large, the algorithm may oscillate or diverge, failing to converge to a solution. Adaptive learning rate methods such as Adam, AdaMax, Adagrad, or RMSprop automatically adjust the learning rate based on the history of gradients. Nevertheless, if the base learning rate is too large, the variational parameters might still oscillate around the optimal solution. With learning rate schedules, the learning rate can be reduced gradually to mitigate this problem. However, the amount at which the learning rate should be decreased in each iteration is not known a priori, which can significantly impact the performance of the optimization. In this work, we propose a method to decay the learning rate based on the history of the variational parameters. We use an empirical measure to quantify the amount of oscillations against the progress of the variational parameters to adapt the learning rate. The approach requires little memory and is computationally efficient. We demonstrate in various numerical examples that our method reduces the sensitivity of the optimization performance to the learning rate and that it can also be used in combination with other adaptive learning rate methods.

Paper Structure

This paper contains 12 sections, 26 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: Optimization progress of SVI for a toy problem with two variational parameters $\boldsymbol{\lambda}$ using Adam with different learning rates $\eta$.
  • Figure 2: Exemplary course of variational parameters $\boldsymbol{\lambda}_i$ over iterations $i$ using Adam with static base learning rate and corresponding mean SNR $\overline{\rho}_i$ (see \ref{['eq:snr']}).
  • Figure 3: Accuracy of the variational distribution over the number of iterations for the synthetic test case (see \ref{['eqn3:synthetic_joint']}) using Adam and SGD optimizer with and without DLRD for varying initial base learning rates $\eta_0$. Accuracy is measured by Jeffreys divergence between the optimal variational distribution and the variational distribution at the current iteration (see \ref{['eqn:DJ']}).
  • Figure 4: Performance comparison between DLRD and learning rate schedule of the form \ref{['eqn:lr_schedule']} with $\zeta=0.5$ (purple) and $\zeta=1$ (yellow) for the synthetic test case (see \ref{['eqn3:synthetic_joint']}) using SGD with an initial learning rate of $\eta_0=1.0e-2$ and $\eta_0=1.0e-3$. Accuracy is measured by Jeffreys divergence between the optimal variational distribution and the variational distribution at the current iteration (see \ref{['eqn:DJ']}).
  • Figure 5: Performance comparison between DLRD, SASA and SASA+ for the synthetic test case (see \ref{['eqn3:synthetic_joint']}) using SGD for varying initial learning rates $\eta_0$. Accuracy is measured by Jeffreys divergence between the optimal variational distribution and the variational distribution at the current iteration (see \ref{['eqn:DJ']}).
  • ...and 3 more figures