Table of Contents
Fetching ...

Fast Compute for ML Optimization

Nick Polson, Vadim Sokolov

TL;DR

The paper introduces Scale Mixture EM (SM-EM), a tuning-free optimizer for losses that admit a variance-mean scale-mixture representation. By applying latent-variable data augmentation (e.g., Pólya–Gamma), SM-EM rewrites updates as a weighted least-squares M-step with a model-derived precision $\tau^{-2}\hat{\Lambda}+X^\top\hat{\Omega}X$, where $\hat{\Omega}$ and $\hat{\Lambda}$ are adaptive, per-iteration weights. The approach unifies proximal and adaptive-gradient perspectives, connecting to Adam/AdamW while deriving curvature- and shrinkage-based weights from the loss geometry; it also enables acceleration with Nesterov and pathwise amortization for regularization grids. Empirically, SM-EM achieves substantially lower final losses than tuned Adam on ill-conditioned logistic benchmarks, with strong gains when extended with Nesterov, and can accelerate regularization paths via shared sufficient statistics and Halton Monte Carlo for large-scale M-steps. The framework offers a principled, model-based alternative to heuristic adaptive methods, with potential extensions to stochastic online variants and broader loss classes.

Abstract

We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.

Fast Compute for ML Optimization

TL;DR

The paper introduces Scale Mixture EM (SM-EM), a tuning-free optimizer for losses that admit a variance-mean scale-mixture representation. By applying latent-variable data augmentation (e.g., Pólya–Gamma), SM-EM rewrites updates as a weighted least-squares M-step with a model-derived precision , where and are adaptive, per-iteration weights. The approach unifies proximal and adaptive-gradient perspectives, connecting to Adam/AdamW while deriving curvature- and shrinkage-based weights from the loss geometry; it also enables acceleration with Nesterov and pathwise amortization for regularization grids. Empirically, SM-EM achieves substantially lower final losses than tuned Adam on ill-conditioned logistic benchmarks, with strong gains when extended with Nesterov, and can accelerate regularization paths via shared sufficient statistics and Halton Monte Carlo for large-scale M-steps. The framework offers a principled, model-based alternative to heuristic adaptive methods, with potential extensions to stochastic online variants and broader loss classes.

Abstract

We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with , SM-EM with Nesterov acceleration attains up to lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.
Paper Structure (52 sections, 6 theorems, 66 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 52 sections, 6 theorems, 66 equations, 8 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

The conditional first moments satisfy

Figures (8)

  • Figure 1: Convergence on moderately conditioned logistic regression ($\mathrm{cond} \approx 50$). SM-EM (no learning rate) achieves the lowest loss. PRM (implicit SGD) converges but requires step size tuning.
  • Figure 2: Adam's final loss as a function of learning rate $\alpha$. SM-EM (dashed line) matches the best Adam result in this sweep without learning-rate tuning.
  • Figure 3: Effect of conditioning. Left: $\mathrm{cond} \approx 50$. Right: $\mathrm{cond} \approx 500$. SM-EM converges robustly in both cases; Adam and PRM degrade as conditioning worsens.
  • Figure 4: Nesterov acceleration. Left: $\mathrm{cond} \approx 50$ (modest effect). Right: $\mathrm{cond} \approx 500$ (55% for SM-EM, 54% for PRM). Table \ref{['tab:nesterov']} reports a separate ill-conditioned experiment with 32% (SM-EM) and 53% (PRM) gains.
  • Figure 5: Left: Final NLL vs. $p$ ($\mathrm{cond}{=}500$, $n{=}5000$). SM-EM+Nesterov (no learning-rate tuning) reaches lower NLL than grid-tuned Adam at every dimension tested; the gap widens with $p$. Adam at its default learning rate ($\alpha{=}10^{-3}$) has not converged in 80 epochs. Right: wall-clock time. SM-EM's $O(p^3)$ solve is the dominant cost for $p \geq 200$.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Proposition 1: Gradient-to-weight identity; Proposition 1 of polson2013data
  • Proposition 2: Weighted least squares M-step; Proposition 2 of polson2013data
  • Remark 1: Adam as a heuristic approximation
  • Proposition 3: Robbins--Monro form; scalar case
  • proof
  • Remark 2
  • Lemma 4: Fenchel--Moreau duality; boyd2004convex, §3.3.2
  • Theorem 5: Nesterov, 1983; Beck and Teboulle, 2009
  • proof
  • Theorem 6: Location mixtures; Geman--Yang
  • ...and 1 more