Stochastic Approximation with Block Coordinate Optimal Stepsizes
Tao Jiang, Lin Xiao
TL;DR
This work introduces BCOS, a family of stochastic approximation methods with block-coordinate stepsizes that aim to minimize the expected distance to an unknown target. By expressing RMSProp and Adam(W) as BCOS variants and developing a practical single-EMA and a conditional estimator, the paper unifies several optimizers under a single framework while reducing memory and tuning burdens. The authors establish convergence under a broad aiming condition that does not require convexity or smoothness, and characterize convergence to a neighborhood whose radius depends on the bias and variance of the second-moment estimator. Empirical results on large-scale models demonstrate BCOSW variants achieving competitive performance with smoother training and fewer optimizer states, suggesting practical utility for deep learning tasks.
Abstract
We consider stochastic approximation with block-coordinate stepsizes and propose adaptive stepsize rules that aim to minimize the expected distance from the next iterate to an (unknown) target point. These stepsize rules employ online estimates of the second moment of the search direction along each block coordinate. The popular Adam algorithm can be interpreted as a variant with a specific estimator. By leveraging a simple conditional estimator, we derive a new method that obtains competitive performance against Adam but requires less memory and fewer hyper-parameters. We prove that this family of methods converges almost surely to a small neighborhood of the target point, and the radius of the neighborhood depends on the bias and variance of the second-moment estimator. Our analysis relies on a simple aiming condition that assumes neither convexity nor smoothness, thus has broad applicability.
