Can LLMs predict the convergence of Stochastic Gradient Descent?

Oussama Zekri; Abdelhakim Benechehab; Ievgen Redko

Can LLMs predict the convergence of Stochastic Gradient Descent?

Oussama Zekri, Abdelhakim Benechehab, Ievgen Redko

TL;DR

Can LLMs predict SGD convergence by treating SGD as a Markov chain and using in-context learning to infer the transition kernel from observed iterates, enabling zero-shot forecasting on unseen initializations. The paper introduces a pipeline that estimates the diagonal blocks $P^{(i,i)}$ of the discretized kernel from LLM logits and completes the global kernel $Q$ with a debiased Sinkhorn barycenter to forecast convergence in both convex and non-convex settings. It also revisits neural scaling laws of ICL from a Markov-chain perspective, highlighting the role of the spectral gap $\rho$ in the learnability of transition dynamics. This framework points toward zero-shot randomized trials for larger DL models, while acknowledging practical scalability and kernel-estimation challenges when extending to real-world, trillion-parameter systems.

Abstract

Large-language models are notoriously famous for their impressive performance across a wide range of tasks. One surprising example of such impressive performance is a recently identified capacity of LLMs to understand the governing principles of dynamical systems satisfying the Markovian property. In this paper, we seek to explore this direction further by studying the dynamics of stochastic gradient descent in convex and non-convex optimization. By leveraging the theoretical link between the SGD and Markov chains, we show a remarkable zero-shot performance of LLMs in predicting the local minima to which SGD converges for previously unseen starting points. On a more general level, we inquire about the possibility of using LLMs to perform zero-shot randomized trials for larger deep learning models used in practice.

Can LLMs predict the convergence of Stochastic Gradient Descent?

TL;DR

of the discretized kernel from LLM logits and completes the global kernel

with a debiased Sinkhorn barycenter to forecast convergence in both convex and non-convex settings. It also revisits neural scaling laws of ICL from a Markov-chain perspective, highlighting the role of the spectral gap

in the learnability of transition dynamics. This framework points toward zero-shot randomized trials for larger DL models, while acknowledging practical scalability and kernel-estimation challenges when extending to real-world, trillion-parameter systems.

Abstract

Paper Structure (15 sections, 7 equations, 6 figures, 1 algorithm)

This paper contains 15 sections, 7 equations, 6 figures, 1 algorithm.

Introduction
Background knowledge
LLMs understand the convergence of SGD
Problem setup
Overparametrized vs. underparametrized regime
From understanding to forecasting
Convex case
Non-convex case
ICL neural scaling laws revisited
Conclusion
Appendix
Detailed ICL for dynamics learning
On the importance of the tokenizer
Obtaining ground truth for SGD
Additional Experiments

Figures (6)

Figure 1: Overview of the proposed approach. After having run SGD on a given optimization problem, we tokenize the obtained iterates and feed them to an LLM of choice. We further use the logits to fill the transition kernel of the Markov chain underlying the SGD with probabilities $P(x_i|x_j)$, while imputing those of its elements that were not observed. Finally, we use the estimate transition kernel to do forecasting for previously unseen inputs.
Figure 2: Top Left and Top Right, a run of SGD in the overparameterized and underparameterized regimes, respectively. Bottom Left and Bottom Right, transition probabilities predicted by LLM in overparameterized and underparameterized regimes.
Figure 3: We optimize $F$ defined in \ref{['MainProb']} with $f(x_i,\theta) = \frac{1}{2}(\langle x_i, \theta \rangle_{\mathbb{R}^2}-y)^2$ for $d=2$ and $N=100$ (see more instances in Appendix \ref{['appendix:more_convex']}). Left. A full SGD run in the convex case. The visited states constitute the time serie shown to the LLM to estimate the transition kernel. Right. Starting from different initial points, simulating the convergence of the SGD with the estimated transition matrix leads to convergence to the same global minima.
Figure 4: We optimize $F$ defined in \ref{['MainProb']} with $f(x_i,\theta) = \frac{1}{2}(\theta_0\sin(\theta_1 x_i)-y)^2$ for $d=2$ and $N=100$. Left. A full SGD run in the non-convex case. The visited states constitute the time serie shown to the LLM and used to estimate the transition kernel. Right. Starting from different initial points, we run the Markov chain with the estimated transition kernel and converge to the same local minima as SGD.
Figure 5: Neural scaling laws for different values of $\rho$. $M(\rho)$ denotes a $2$-state Markov chain of spectral gap $\rho$.
...and 1 more figures

Can LLMs predict the convergence of Stochastic Gradient Descent?

TL;DR

Abstract

Can LLMs predict the convergence of Stochastic Gradient Descent?

Authors

TL;DR

Abstract

Table of Contents

Figures (6)