Table of Contents
Fetching ...

High dimensional theory of two-phase optimizers

Atish Agarwala

Abstract

The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective'' Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.

High dimensional theory of two-phase optimizers

Abstract

The trend towards larger training setups has brought a renewed interest in partially asynchronous two-phase optimizers which optimize locally and then synchronize across workers. Additionally, recent work suggests that the one-worker version of one of these algorithms, DiLoCo, shows promising results as a (synchronous) optimizer. Motivated by these studies we present an analysis of LA-DiLoCo, a simple member of the DiLoCo family, on a high-dimensional linear regression problem. We show that the one-worker variant, LA, provides a different tradeoff between signal and noise than SGD, which is beneficial in many scenarios. We also show that the multi-worker version generates more noise than the single worker version, but that this additional noise generation can be ameliorated by appropriate choice of hyperparameters. We conclude with an analysis of SLA -- LA with momentum -- and show that stacking two momentum operators gives an opportunity for acceleration via a non-linear transformation of the "effective'' Hessian spectrum, which is maximized for Nesterov momentum. Altogether our results show that two-phase optimizers represent a fruitful new paradigm for understanding and improving training algorithms.

Paper Structure

This paper contains 21 sections, 2 theorems, 95 equations, 3 figures.

Key Result

Theorem 3.1

Consider the dynamics of Equation eq:diloco_pvec for fixed $\nu$ and $\eta$. Suppose $\boldsymbol{\Lambda}$ has identical eigenvalues. Then, for fixed total batch size $B_{\rm tot}\equiv BR$, the eigenvalues of the linear system strictly increase with $R$.

Figures (3)

  • Figure 1: Loss after one cycle of $S = 10$ steps as a function of $\eta$ and $\nu$ for spiked spectrum (left). Model has batch fraction $B/D = 0.2$, and spectrum has two eigenvalues $\lambda_{0}$ ($99\%$ of eigenmodes) and $20\lambda_{0}$ ($1\%$ of eigenmodes). Optimal $\eta^*(\nu)$ decreases in $\nu$. Optimizing $\eta$ for each $\nu$ reveals that the optimal $\nu$ is greater than $1$ (middle). Power law spectrum with exponent $\alpha = -1.5$ has optimal $\nu<1$ (right, $B/D = 0.005$).
  • Figure 2: Averaged loss curves for LA diloco (solid lines) are well captured by theoretical model (dashed lines) across many settings with fixed total batch size (all panels, $D = 3200$, $B_{tot} = 64$, various learning rates). Keeping $\eta$ and $\nu$ independent of number of workers $R$ leads to similar dynamics at smaller learning rates (top left) but causes learning curves for larger $R$ to diverge early at larger learning rates (top right). $R^{-1/2}$ scaling rule for $\eta$, with $\nu\eta$ fixed gives better correspondence across $R$ for all learning rates (bottom), but dynamics still becomes non-universal at larger $\eta$ (bottom right)
  • Figure 3: Eigenmodes damped by inner optimizer show decreasing per-step convergence rate with momentum-GD in the outer optimizer (top, $\nu = 2$). Convergence rate is improved for Nesterov momentum. At the critical $\nu = (1-\beta_{\rm out})^{-1}$, Nesterov convergence rate becomes $S$-invariant at large $S$ (bottom). Reduced system with $\tilde{\mathbf{k}}_{t}:=0$ every cycle has similar behavior to full dynamics with $\tilde{\mathbf{k}}_{t}$ preserved for larger $S$, but shows faster convergence/better stability at smaller $S$. All experiments have $\eta = 1$, $\lambda = 0.2$, $\beta_{\rm in} = 0.9$, $\beta_{\rm out} = 0.8$.

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 1.1
  • proof