Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach; Adel Javanmard; Vahab Mirrokni; Pratik Worah

Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah

TL;DR

This work investigates how to set learning-rate schedules for SGD when data distributions shift over time. It introduces a dynamic-regret framework and develops an SDE-based analysis for online linear regression to derive optimal adaptive schedules, then extends to general convex and non-convex losses with provable upper and lower bounds that scale with distribution shift and gradient noise. The results show that optimal rates typically increase with shift, and they demonstrate the effectiveness of the proposed schedules in high-dimensional regression and a flow cytometry streaming task, including neural-network examples. The findings offer practical guidance for online learning systems facing non-stationary environments, connecting online optimization, stochastic calculus, and control-theoretic planning to design robust, shift-aware training strategies.

Abstract

We design learning rate schedules that minimize regret for SGD-based online learning in the presence of a changing data distribution. We fully characterize the optimal learning rate schedule for online linear regression via a novel analysis with stochastic differential equations. For general convex loss functions, we propose new learning rate schedules that are robust to distribution shift and we give upper and lower bounds for the regret that only differ by constants. For non-convex loss functions, we define a notion of regret based on the gradient norm of the estimated models and propose a learning schedule that minimizes an upper bound on the total expected regret. Intuitively, one expects changing loss landscapes to require more exploration, and we confirm that optimal learning rate schedules typically increase in the presence of distribution shift. Finally, we provide experiments for high-dimensional regression models and neural networks to illustrate these learning rate schedules and their cumulative regret.

Learning Rate Schedules in the Presence of Distribution Shift

TL;DR

Abstract

Paper Structure (34 sections, 11 theorems, 116 equations, 6 figures, 1 algorithm)

This paper contains 34 sections, 11 theorems, 116 equations, 6 figures, 1 algorithm.

Introduction
Linear regression.
Convex loss functions.
Non-convex loss functions.
Related work
Connections to online optimization.
Overview of techniques
Problem formulation: Dynamic regret
Linear regression
Case study: No distribution shift
General convex loss
Upper bound on the total regret
Lower bound on the total regret
Non-convex loss
Experiments
...and 19 more sections

Key Result

Proposition 3.1

For any fixed $T,u>0$, there exists a constant $C = C(K,\Gamma, d,\sigma, T,u)$, with parameters $K,\Gamma$ given in Assumptions A1-A2, such that with probability at least $1-e^{-u^2}$ we have

Figures (6)

Figure 1: SGD trajectories for online linear regression with different constant learning rates. The discrete blue spirals are the optimal model weights $\theta_{t}^* \in \mathbb{R}^{2}$, which start at $(1,0)$ and jump clockwise every $100$ steps. The orange paths are the learned weights $\theta_t$, starting at $\theta_0 = 0$ for $0 \le t \le 17 \cdot 100$. The orange squares depict the position every $100$ steps. We use batch size $B_t = 1$ and step sizes $\eta_{t} \in \{0.003, 0.01, 0.03, 0.1\}$ from left to right. The rightmost SGD is the most out of control, but it incurs the least regret because it adapts to changes in $\theta_{t}^*$ the fastest without diverging.
Figure 2: Learning rate schedules $\eta_t^*$ devised in Algorithm \ref{['alg:linear']} for online linear regression. The batch size is $B_t = 100$ for all $1 \le t \le 200$, dimension $d = 100$, max step size ${\varepsilon} = 0.1$, and $\sigma = 2$.
Figure 3: The process $\tilde{v}_\tau$ defined by ODE \ref{['eq:tv-n']} if there is no distribution shift (left). Here we have ${\varepsilon} = 0.1$, ${\sf a}:={\varepsilon}(d+1)/B = 0.1$, ${\sf b}:= {\varepsilon} \sigma^2 d/B = 0.3$, and initialization $\tilde{v}_0 = 1$. Behavior of the learning rate schedule $\eta^*_t$ given by Algorithm \ref{['alg:linear']}, which asymptotically has the rate $1/t$ (right).
Figure 4: SGD trajectories of \ref{['alg:linear']} (top); and oscillating learning rates $\eta_t$ as we discretize the path defined by $\theta_{t}^*$ where $\eta_{\max} = 0.5$ (bottom).
Figure 5: Cumulative regret of \ref{['alg:linear']} with $\eta_{\max} = 1/\sqrt{d}$ for increasing dimensions $d$ (top-left); and the first and second coordinates of the SGD for $d=128$ and batch size $B_t = 256$ (top-right). Cumulative regret of \ref{['propo:optimal-eta']} for $d$-dimensional logistic regression (bottom-left); and the first and second coordinates of the SGD for $d=128$ and batch size $B_t = 256$ (bottom-right).
...and 1 more figures

Theorems & Definitions (18)

Definition 2.1: Distribution shift
Proposition 3.1
Theorem 3.2
Theorem 3.3
Remark 3.4
Remark 3.5
Lemma 3.6
Theorem 4.2
Proposition 4.3: Learning rate schedule
Remark 4.4
...and 8 more

Learning Rate Schedules in the Presence of Distribution Shift

TL;DR

Abstract

Learning Rate Schedules in the Presence of Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (18)