Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

Xufeng Cai; Jelena Diakonikolas

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

Xufeng Cai, Jelena Diakonikolas

TL;DR

The paper establishes non-ergodic, last-iterate convergence guarantees for incremental gradient and proximal methods handling finite-sum convex objectives, with clear motivation from continual learning. It derives matching-or-near-matching oracle complexity bounds to the best known average-iterate results, including extensions to shuffled permutations and increasing weighted averaging. For IGD and IPM in smooth convex settings, the last-iterate rates closely mirror average-iterate rates, while in convex Lipschitz cases IPM achieves $\widetilde{\mathcal{O}}(\frac{G^2 T \|x_0-x_\*\|^2}{\epsilon^2})$ bounds; inexact proximal evaluations are accommodated as well. The work also models continual learning via ridge-type regularization in IPM, highlighting both its potential to mitigate forgetting and its limitations, with results showing that forgetting can be catastrophic under insufficient regularization and that regularization must scale polynomially with problem parameters to achieve target accuracy.

Abstract

Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

TL;DR

bounds; inexact proximal evaluations are accommodated as well. The work also models continual learning via ridge-type regularization in IPM, highlighting both its potential to mitigate forgetting and its limitations, with results showing that forgetting can be catastrophic under insufficient regularization and that regularization must scale polynomially with problem parameters to achieve target accuracy.

Abstract

Paper Structure (21 sections, 25 theorems, 112 equations, 1 figure, 2 algorithms)

This paper contains 21 sections, 25 theorems, 112 equations, 1 figure, 2 algorithms.

Introduction
Contributions
Last iterate convergence of Incremental Gradient Descent (IGD).
Last iterate convergence of Incremental Proximal Method (IPM).
IPM as a model of CL.
Further related work
Concurrent independent work.
Notation and preliminaries
Last Iterate Convergence of Incremental Gradient Descent
Shuffled SGD.
Incremental Proximal Method
Smooth convex setting
Regularization effect.
Convex Lipschitz setting
Inexact proximal point evaluations
...and 6 more sections

Key Result

Lemma 2.0

Under Assumptions assp:convex and assp:smooth, for any ${\bm{z}} \in \mathbb{R}^d$ that is fixed in the $k$-th cycle of Alg. alg:IGD and any $\alpha, \beta > 0$ such that $\frac{1}{\alpha} + \frac{1}{\beta} \leq \frac{1}{2}$, if $\eta_k \leq \frac{1}{\sqrt{\beta}TL}$, then for all $k \in [K],$

Figures (1)

Figure 1: Numerical results of performing IPM on $T$ component least square functions, corresponding to the $\ell_2$-regularized CL setting of $T$ linear regression tasks with cyclic replays.

Theorems & Definitions (40)

Lemma 2.0
Lemma 2.0
Theorem 2.1
Corollary 2.1: Increasing Weighted Averaging
Corollary 2.1: Shuffled SGD (RR/SO)
Theorem 3.1
Theorem 3.2
proof : Proof sketch
Theorem 3.3
Corollary 3.3: Convex Smooth
...and 30 more

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

TL;DR

Abstract

Last Iterate Convergence of Incremental Methods and Applications in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (40)