Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li; Fengling Chen; Zixun Huang; Lean Wang; Lei Wu

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

Binghui Li, Fengling Chen, Zixun Huang, Lean Wang, Lei Wu

TL;DR

This work develops a functional scaling framework for SGD in power-law kernel regression (PLK) that extends traditional final-loss scaling to the entire loss trajectory. By introducing intrinsic-time SDEs and a Functional Scaling Law (FSL), the authors quantify how learning-rate schedules shape loss dynamics through a convolutional forgetting kernel, enabling explicit data- and compute-optimal scaling relations for constant, exponential, and warmup-stable-decay (WSD) schedules. They show that higher-capacity models are more data- and compute-efficient, that learning-rate decay generally improves scaling efficiency, and that WSD-type schedules outperform pure decay in many regimes. Experiments on synthetic PLK setups and LLM pre-training tasks (0.1B–1B params) validate FSL as a practical surrogate for predicting loss trajectories and guiding LRS design in large-scale pre-training.

Abstract

Scaling laws have emerged as a unifying lens for understanding and guiding the training of large language models (LLMs). However, existing studies predominantly focus on the final-step loss, leaving open whether the entire loss dynamics obey similar laws and, crucially, how the learning rate schedule (LRS) shapes them. We address these gaps in a controlled theoretical setting by analyzing stochastic gradient descent (SGD) on a power-law kernel regression model. The key insight is a novel intrinsic-time viewpoint, which captures the training progress more faithfully than iteration count. We then establish a Functional Scaling Law (FSL) that captures the full loss trajectory under arbitrary LRSs, with the schedule's influence entering through a simple convolutional functional. We further instantiate the theory for three representative LRSs -- constant, exponential decay, and warmup-stable-decay (WSD) -- and derive explicit scaling relations in both data- and compute-limited regimes. These comparisons explain key empirical phenomena: (i) higher-capacity models are more data- and compute-efficient; (ii) learning-rate decay improves training efficiency; and (iii) WSD-type schedules outperform pure decay. Finally, experiments on LLMs ranging from 0.1B to 1B parameters demonstrate the practical relevance of FSL as a surrogate model for fitting and predicting loss trajectories in large-scale pre-training.

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

TL;DR

Abstract

Functional Scaling Laws in Kernel Regression: Loss Dynamics and Learning Rate Schedules

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (86)