Table of Contents
Fetching ...

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp

TL;DR

This work analyzes an idealized stochastic Polyak step size SPS$^*$ that leverages the loss at the solution, $f_{\xi}(x_*)$, to achieve favorable convergence for convex and locally regular losses. By proving an anytime convergence bound for SPS$^*$ under a local expected gradient bound and developing a momentum-enhanced variant IAM, the authors obtain last-iterate convergence results in both non-smooth and smooth settings and show that IAM matches the rate of SPS$^*$ on averages while delivering strong last-iterate performance. The paper also demonstrates a practical application in black-box model distillation, where a large teacher model guides training of a smaller student without hyperparameter tuning, via the surrogate loss $f_{\xi}(x_*)$ provided by the teacher. While SPS$^*$ remains idealized due to the need for $f_{\xi}(x_*)$, the authors discuss realistic approximations and practical safeguards, confirming the potential of adaptive Polyak-type steps for efficient stochastic optimization in structured tasks. The results offer significant implications for hyperparameter-less training regimes and knowledge distillation workflows, demonstrating both theoretical rigor and practical impact.

Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

TL;DR

This work analyzes an idealized stochastic Polyak step size SPS that leverages the loss at the solution, , to achieve favorable convergence for convex and locally regular losses. By proving an anytime convergence bound for SPS under a local expected gradient bound and developing a momentum-enhanced variant IAM, the authors obtain last-iterate convergence results in both non-smooth and smooth settings and show that IAM matches the rate of SPS on averages while delivering strong last-iterate performance. The paper also demonstrates a practical application in black-box model distillation, where a large teacher model guides training of a smaller student without hyperparameter tuning, via the surrogate loss provided by the teacher. While SPS remains idealized due to the need for , the authors discuss realistic approximations and practical safeguards, confirming the potential of adaptive Polyak-type steps for efficient stochastic optimization in structured tasks. The results offer significant implications for hyperparameter-less training regimes and knowledge distillation workflows, demonstrating both theoretical rigor and practical impact.

Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an anytime convergence in the smooth setting. We show how to combine SPS with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Paper Structure

This paper contains 54 sections, 39 theorems, 190 equations, 8 figures, 3 tables, 1 algorithm.

Key Result

theorem 0

Consider problem eq:prob and let $(x_t)_{t \geq 0}$ be the iterates of SPS* given by eqn:sps-iter. Then the iterates are almost surely monotone: If there exists $A,B \geq 0$ with $A+B\neq 0$ and such that for all $x \in \mathbb{B}_D(x_*)$, then the averaged iterates of SPS* $\bar{x}_T := \tfrac{1}{T} \sum_{t=0}^{T-1} x_t$ verify:

Figures (8)

  • Figure 1: Distilling a teacher GPT2 on three datasets. Adaptive learning rate of IAM and learning rates of SGD(top) and cross-entropy training loss (bottom). Black line marks the average teacher loss.
  • Figure 2: Diabetes Data, 15 epochs
  • Figure 4: Interpolation true:IAM with the correct $f_{\xi_t}(x_*)$ converges as fast as SGD-M with the theoretical step size $\frac{1}{4L_{\max}}$. When $\nu$ is small (left), the initial progress of IAM with the average $f(x_*)$ is equally good, before it stales. For $\nu$ large, the convergence stales earlier (midlle). Increasing the batch size (right) slightly increases the gap between IAM with $f_{\xi_t}(x_*)=0$ and $f_{\xi_t}(x_*)=f(x_*)$.
  • Figure 5: Interpolation false: see caption of \ref{['fig:lb-ablation-true']}.
  • Figure 6: Full display of \ref{['fig:distill']}. Adaptive learning rate of IAM-Adam compared to Adam(top), of IAM compared to SGD(middle), and the cross-entropy training loss (bottom). Black line marks the average teacher loss.
  • ...and 3 more figures

Theorems & Definitions (78)

  • theorem 0: Convergence of *
  • corollary 0: Non-smooth setting
  • corollary 0: Smooth setting
  • remark 1: Finite sum
  • lemma 2
  • theorem 2: Non-smooth setting
  • theorem 2: Smooth setting
  • definition 3
  • definition 4
  • proposition 5
  • ...and 68 more