Table of Contents
Fetching ...

Optimal Rates in Continual Linear Regression via Increasing Regularization

Ran Levinstein, Amit Attia, Matan Schliserman, Uri Sherman, Tomer Koren, Daniel Soudry, Itay Evron

TL;DR

This work studies realizable continual linear regression under random task orderings, addressing the gap between the known $\Omega(1/k)$ lower bound and prior unregularized upper bounds $O(1/k^{1/4})$. By reducing both explicit isotropic regularization and implicit finite-step regularization to Incremental Gradient Descent on carefully constructed surrogate losses, the authors perform a unified last-iterate SGD analysis. They show that a fixed regularization strength yields a near-optimal $O(\log k / k)$ rate, while an increasing regularization schedule attains the optimal $O(1/k)$ rate for the last iterate, and extend these results to the seen-task loss. The findings provide principled guidance on regularization scheduling to mitigate forgetting in worst-case continual sequences and connect practical regimens to tight theoretical guarantees with potential implications for broader continual learning settings.

Abstract

We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after $k$ learning iterations admits a lower bound of $Ω(1/k)$. However, prior work using an unregularized scheme has only established an upper bound of $O(1/k^{1/4})$, leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic $\ell_2$ regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of $O(\log k / k)$. Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of $O(1/k)$. This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.

Optimal Rates in Continual Linear Regression via Increasing Regularization

TL;DR

This work studies realizable continual linear regression under random task orderings, addressing the gap between the known lower bound and prior unregularized upper bounds . By reducing both explicit isotropic regularization and implicit finite-step regularization to Incremental Gradient Descent on carefully constructed surrogate losses, the authors perform a unified last-iterate SGD analysis. They show that a fixed regularization strength yields a near-optimal rate, while an increasing regularization schedule attains the optimal rate for the last iterate, and extend these results to the seen-task loss. The findings provide principled guidance on regularization scheduling to mitigate forgetting in worst-case continual sequences and connect practical regimens to tight theoretical guarantees with potential implications for broader continual learning settings.

Abstract

We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after learning iterations admits a lower bound of . However, prior work using an unregularized scheme has only established an upper bound of , leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of . Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of . This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.

Paper Structure

This paper contains 34 sections, 13 theorems, 125 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Lemma 3.1

For $t\in \left[{k}\right]$, define $f^{(t)}_{r}, f^{(t)}_{b}$ as in reduc:regularized_to_sgdreduc:early_to_sgd, and recall the data radius $R \triangleq \max_{m \in \left[{M}\right]} \left\Vert{\mathbf{X}_m}\right\Vert_2$.

Figures (2)

  • Figure 1: Schematic overview of our contributions compared to prior results.evron2025better reduce unregularized continual linear regression to incremental gradient descent on a surrogate objective with fixed smoothness. They then analyze the last iterate of SGD to derive a loss rate of $\mathcal{O}({1}/{k^{1/4}})$ under random task orderings. In contrast, we show that adding explicit or implicit regularization enables tuning the smoothness of the corresponding surrogate objective. Importantly, this added flexibility allows a more nuanced last-iterate analysis: a well-tuned fixed regularization strength yields a near-optimal $\mathcal{O}(\log k / k)$ rate, while a specific increasing schedule achieves the first $\mathcal{O}(1/k)$ rate for continual linear regression under random orderings.
  • Figure 2: Optimal fixed regularization grows with horizon and task angle. Each curve shows $\lambda^\star(k;\theta)$ obtained by minimizing the expected loss after $k$ steps of the explicit-regularization scheme with a constant $\lambda$. We observe an approximately linear growth in $k$ and higher optimal regularization for larger $\theta$.

Theorems & Definitions (32)

  • Definition 2.1: Average loss
  • Remark 2.2: Forgetting and seen-task loss
  • Remark 2.3: Unregularized first task
  • Lemma 3.1: Properties of the IGD objectives
  • Definition 4.2: Random task ordering
  • Lemma 4.3: Rates for fixed regularization strength
  • Corollary 4.4: Near-optimal rates via fixed regularization strength
  • Remark 4.5: Extension to without replacement orderings
  • Theorem 4.6: Optimal rates for increasing regularization
  • Lemma 4.7: SGD bound for time-varying distributions
  • ...and 22 more