Table of Contents
Fetching ...

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

TL;DR

This work develops a functional scaling framework for SGD in power-law kernel regression (PLK) that extends traditional final-loss scaling to the entire loss trajectory. By introducing intrinsic-time SDEs and a Functional Scaling Law (FSL), the authors quantify how learning-rate schedules shape loss dynamics through a convolutional forgetting kernel, enabling explicit data- and compute-optimal scaling relations for constant, exponential, and warmup-stable-decay (WSD) schedules. They show that higher-capacity models are more data- and compute-efficient, that learning-rate decay generally improves scaling efficiency, and that WSD-type schedules outperform pure decay in many regimes. Experiments on synthetic PLK setups and LLM pre-training tasks (0.1B–1B params) validate FSL as a practical surrogate for predicting loss trajectories and guiding LRS design in large-scale pre-training.

Abstract

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

TL;DR

This work develops a functional scaling framework for SGD in power-law kernel regression (PLK) that extends traditional final-loss scaling to the entire loss trajectory. By introducing intrinsic-time SDEs and a Functional Scaling Law (FSL), the authors quantify how learning-rate schedules shape loss dynamics through a convolutional forgetting kernel, enabling explicit data- and compute-optimal scaling relations for constant, exponential, and warmup-stable-decay (WSD) schedules. They show that higher-capacity models are more data- and compute-efficient, that learning-rate decay generally improves scaling efficiency, and that WSD-type schedules outperform pure decay in many regimes. Experiments on synthetic PLK setups and LLM pre-training tasks (0.1B–1B params) validate FSL as a practical surrogate for predicting loss trajectories and guiding LRS design in large-scale pre-training.

Abstract

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Paper Structure

This paper contains 63 sections, 45 theorems, 300 equations, 10 figures, 1 table.

Key Result

Theorem 4.2

Under Assumption ass:fsl, let $\bm{\nu}_t$ denote the solution to the intrinsic-time SDE eqn: sde-instrinc with top-$M$ features. Then, for any $f^*$ with difficulty $s\in (0,1-1/\beta]$ and any $\sigma \geqslant 0$, it holds for all $t\geqslant 0$ that where $e(t):=(1+t)^{-s},\; \mathcal{K}(t):=(1+t)^{-(2-1/\beta)}$. For the random-$M$ case, the same FSL holds with probability at least $1 - \exp

Figures (10)

  • Figure 1: Functional Scaling Law (FSL) accurately captures the loss dynamics and scaling behavior of SGD in kernel regression. In both subplots, solid lines denote the results of SGD, and dashed lines indicate the corresponding FSL predictions. (a) Loss dynamics of SGD (averaged over 1000 runs) compared with FSL predictions under three learning-rate schedules: cosine, WSD-like, and a non-standard cyclic schedule. (b) Final-loss scaling predicted by FSL using the analytical formulas from Section \ref{['sec: effect-lrs']}, compared with the mean of 200 independent SGD runs.
  • Figure 2: Experiment on LLMs.(a) Fitting and predictive accuracy of the FSL on dense LLaMA models. (b) Left: comparison of various LRSs. Right: loss trajectories of the FSL-optimal schedule versus baseline LRSs on a 1B QwenMoE model.
  • Figure 3: Fitting results of FSL on SGD trajectories. The shaded curves are the average over 200 independent SGD runs, while the solid curves show the predictions of FSL.
  • Figure 4: Experiment on the 1B LLaMA (dense) model. Figure (a): We fit our functional scaling law on the loss curve of 1B LLaMA (dense) model with 20B tokens training data and 8-1-1 LRS. Figures (b)(c): The comparison on the 1B model between the optimal LRS, cosine LRS, WSD LRS with exponential decay and 8-1-1 LRS.
  • Figure 5: Experiment on the 100M GPT2 (dense) model. Figure (a): We fit our functional scaling law on the loss curve of 100M GPT2 (dense) model with 20B tokens training data and 8-1-1 LRS. Figures (b)(c): The comparison on the 100M model between the optimal LRS, cosine LRS, WSD LRS with exponential decay and 8-1-1 LRS.
  • ...and 5 more figures

Theorems & Definitions (86)

  • Remark 2.4
  • Remark 3.1
  • Theorem 4.2: Intrinsic-Time FSL, hard-regime
  • Theorem 4.3: Intrinsic-Time FSL, top-$M$ features, general label noise
  • Theorem 4.4: Intrinsic-Time FSL, top-$M$ features, constant label noise
  • Theorem 4.5: Intrinsic-Time FSL, top-$M$ features, zero label noise
  • Theorem 4.6: Intrinsic-Time FSL, random-$M$ features
  • Lemma 4.7
  • Lemma 4.8: Noise structure
  • Lemma 4.9: Scale invariance
  • ...and 76 more