Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation
Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp
TL;DR
This work analyzes an idealized stochastic Polyak step size SPS$^*$ that leverages the loss at the solution, $f_{\xi}(x_*)$, to achieve favorable convergence for convex and locally regular losses. By proving an anytime convergence bound for SPS$^*$ under a local expected gradient bound and developing a momentum-enhanced variant IAM, the authors obtain last-iterate convergence results in both non-smooth and smooth settings and show that IAM matches the rate of SPS$^*$ on averages while delivering strong last-iterate performance. The paper also demonstrates a practical application in black-box model distillation, where a large teacher model guides training of a smaller student without hyperparameter tuning, via the surrogate loss $f_{\xi}(x_*)$ provided by the teacher. While SPS$^*$ remains idealized due to the need for $f_{\xi}(x_*)$, the authors discuss realistic approximations and practical safeguards, confirming the potential of adaptive Polyak-type steps for efficient stochastic optimization in structured tasks. The results offer significant implications for hyperparameter-less training regimes and knowledge distillation workflows, demonstrating both theoretical rigor and practical impact.
Abstract
We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.
