Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower; Guillaume Garrigos; Nicolas Loizou; Dimitris Oikonomou; Konstantin Mishchenko; Fabian Schaipp

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp

TL;DR

This work analyzes an idealized stochastic Polyak step size SPS$^*$ that leverages the loss at the solution, $f_{\xi}(x_*)$, to achieve favorable convergence for convex and locally regular losses. By proving an anytime convergence bound for SPS$^*$ under a local expected gradient bound and developing a momentum-enhanced variant IAM, the authors obtain last-iterate convergence results in both non-smooth and smooth settings and show that IAM matches the rate of SPS$^*$ on averages while delivering strong last-iterate performance. The paper also demonstrates a practical application in black-box model distillation, where a large teacher model guides training of a smaller student without hyperparameter tuning, via the surrogate loss $f_{\xi}(x_*)$ provided by the teacher. While SPS$^*$ remains idealized due to the need for $f_{\xi}(x_*)$, the authors discuss realistic approximations and practical safeguards, confirming the potential of adaptive Polyak-type steps for efficient stochastic optimization in structured tasks. The results offer significant implications for hyperparameter-less training regimes and knowledge distillation workflows, demonstrating both theoretical rigor and practical impact.

Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

TL;DR

This work analyzes an idealized stochastic Polyak step size SPS

that leverages the loss at the solution,

, to achieve favorable convergence for convex and locally regular losses. By proving an anytime convergence bound for SPS

under a local expected gradient bound and developing a momentum-enhanced variant IAM, the authors obtain last-iterate convergence results in both non-smooth and smooth settings and show that IAM matches the rate of SPS

on averages while delivering strong last-iterate performance. The paper also demonstrates a practical application in black-box model distillation, where a large teacher model guides training of a smaller student without hyperparameter tuning, via the surrogate loss

provided by the teacher. While SPS

remains idealized due to the need for

, the authors discuss realistic approximations and practical safeguards, confirming the potential of adaptive Polyak-type steps for efficient stochastic optimization in structured tasks. The results offer significant implications for hyperparameter-less training regimes and knowledge distillation workflows, demonstrating both theoretical rigor and practical impact.

Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS

. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS

as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an

anytime convergence in the smooth setting. We show how to combine SPS

with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

TL;DR

Abstract

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (78)