Table of Contents
Fetching ...

Stochastic Approximation with Block Coordinate Optimal Stepsizes

Tao Jiang, Lin Xiao

TL;DR

This work introduces BCOS, a family of stochastic approximation methods with block-coordinate stepsizes that aim to minimize the expected distance to an unknown target. By expressing RMSProp and Adam(W) as BCOS variants and developing a practical single-EMA and a conditional estimator, the paper unifies several optimizers under a single framework while reducing memory and tuning burdens. The authors establish convergence under a broad aiming condition that does not require convexity or smoothness, and characterize convergence to a neighborhood whose radius depends on the bias and variance of the second-moment estimator. Empirical results on large-scale models demonstrate BCOSW variants achieving competitive performance with smoother training and fewer optimizer states, suggesting practical utility for deep learning tasks.

Abstract

We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an (unknown) target point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a variant with a specific estimator. By leveraging a simple conditional estimator, we derive a new method that obtains competitive performance against Adam but requires less memory and fewer hyper-parameters. We prove that this family of methods converges almost surely to a small neighborhood of the target point, and the radius of the neighborhood depends on the bias and variance of the second-moment estimator. Our analysis relies on a simple aiming condition that assumes neither convexity nor smoothness, thus has broad applicability.

Stochastic Approximation with Block Coordinate Optimal Stepsizes

TL;DR

This work introduces BCOS, a family of stochastic approximation methods with block-coordinate stepsizes that aim to minimize the expected distance to an unknown target. By expressing RMSProp and Adam(W) as BCOS variants and developing a practical single-EMA and a conditional estimator, the paper unifies several optimizers under a single framework while reducing memory and tuning burdens. The authors establish convergence under a broad aiming condition that does not require convexity or smoothness, and characterize convergence to a neighborhood whose radius depends on the bias and variance of the second-moment estimator. Empirical results on large-scale models demonstrate BCOSW variants achieving competitive performance with smoother training and fewer optimizer states, suggesting practical utility for deep learning tasks.

Abstract

We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an (unknown) target point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a variant with a specific estimator. By leveraging a simple conditional estimator, we derive a new method that obtains competitive performance against Adam but requires less memory and fewer hyper-parameters. We prove that this family of methods converges almost surely to a small neighborhood of the target point, and the radius of the neighborhood depends on the bias and variance of the second-moment estimator. Our analysis relies on a simple aiming condition that assumes neither convexity nor smoothness, thus has broad applicability.

Paper Structure

This paper contains 32 sections, 15 theorems, 132 equations, 8 figures, 4 algorithms.

Key Result

Lemma 1

Suppose Assumption assum:aiming holds, $\alpha_t\geq 0$ and $\alpha_t\lambda<1$ for all $t\geq 0$. Then the sequence $\{x_t\}$ generated by eqn:bcosw-conceptual satisfies, for all $t\geq 0$, where

Figures (8)

  • Figure 1: Comparing AdamW and BCOSW-c with different momentum parameters.
  • Figure 2: Comparing AdamW and three variants of BCOSW. Left: the first 10k iterations; Right: all 100k iterations.
  • Figure 3: Left: Adam(W) with $\beta_1=0.9$ and $\beta_2=0.99$. Right: BCOS(W)-c with $\beta=0.9$.
  • Figure 4: Train and test loss by varying max value of stepsize schedule.
  • Figure 5: Left: ResNet-20 on CIFAR10. Right: Vision Transformer on ImageNet.
  • ...and 3 more figures

Theorems & Definitions (15)

  • Lemma 1
  • Theorem 2: Almost sure convergence
  • Lemma 3: ROBBINS1971
  • Corollary 4
  • Theorem 5
  • Lemma 6: chung1954
  • Theorem 7
  • Lemma 8: chung1954
  • Lemma 9
  • Lemma 10
  • ...and 5 more