On the Convergence of Continual Learning with Adaptive Methods

Seungyub Han; Yeongmo Kim; Taehyun Cho; Jungwoo Lee

On the Convergence of Continual Learning with Adaptive Methods

Seungyub Han, Yeongmo Kim, Taehyun Cho, Jungwoo Lee

TL;DR

The paper addresses the convergence of memory-based continual learning by formulating it as a smooth nonconvex finite-sum problem and showing that training current tasks can cause forgetting of past tasks through a forgetting term $ ext{Gamma}_t$. It introduces NCCL, an adaptive method that jointly tunes step sizes across previous and current tasks to minimize forgetting while preserving convergence, and provides a theoretical bound indicating SGD-like rates under decaying learning rates when the forgetting term is controlled. A reformulated objective to minimize the cumulative forgetting term under a constraint on step sizes guides the adaptive design. Empirically, NCCL with memory schemes outperforms baselines on standard image-classification continual learning benchmarks, demonstrating significantly reduced forgetting and competitive convergence performance, with the memory bias term $B_t$ contributing minimally to degradation.

Abstract

One of the objectives of continual learning is to prevent catastrophic forgetting in learning multiple tasks sequentially, and the existing solutions have been driven by the conceptualization of the plasticity-stability dilemma. However, the convergence of continual learning for each sequential task is less studied so far. In this paper, we provide a convergence analysis of memory-based continual learning with stochastic gradient descent and empirical evidence that training current tasks causes the cumulative degradation of previous tasks. We propose an adaptive method for nonconvex continual learning (NCCL), which adjusts step sizes of both previous and current tasks with the gradients. The proposed method can achieve the same convergence rate as the SGD method when the catastrophic forgetting term which we define in the paper is suppressed at each iteration. Further, we demonstrate that the proposed algorithm improves the performance of continual learning over existing methods for several image classification tasks.

On the Convergence of Continual Learning with Adaptive Methods

TL;DR

. It introduces NCCL, an adaptive method that jointly tunes step sizes across previous and current tasks to minimize forgetting while preserving convergence, and provides a theoretical bound indicating SGD-like rates under decaying learning rates when the forgetting term is controlled. A reformulated objective to minimize the cumulative forgetting term under a constraint on step sizes guides the adaptive design. Empirically, NCCL with memory schemes outperforms baselines on standard image-classification continual learning benchmarks, demonstrating significantly reduced forgetting and competitive convergence performance, with the memory bias term

contributing minimally to degradation.

Abstract

Paper Structure (28 sections, 9 theorems, 81 equations, 2 figures, 13 tables, 1 algorithm)

This paper contains 28 sections, 9 theorems, 81 equations, 2 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Continual Learning as Nonconvex Optimization
Memory-based Nonconvex Continual Learning
Theoretical Convergence Analysis
Reformulated Problem of Continual Learning
Adaptive Methods for Continual Learning
A-GEM
NCCL
Experiments
Experimental setup
Experiment Results
Conclusion
Additional Backgrounds and Extended Discussion
...and 13 more sections

Key Result

Lemma 4.1

If $M_0$ is uniformly sampled from $P$, then both episodic memory and ER-reservoir satisfies

Figures (2)

Figure 1: Geometric illustration of Non-Convex Continual Learning (NCCL). In continual learning, a model parameter $x^t$ starts from a local optimal point for the previously learned tasks $x_{P}^*$. Over $T$ iterations, we expect to reach a new optimal point $x_{P\cup C}^*$ which has a good performance on both $P$ and $C$. In the $t$-th iteration, $x^t$ encounters either $\nabla g_{J_{t},pos}(x^t)$ or $\nabla g_{J_{t},neg}(x^t)$. These two cases indicate whether $\langle f_{I_t} (x^t), \nabla g_{J_{t}}(x^t) \rangle$ is positive or not. To prevent $x^t$ from escaping the feasible region, i.e., catastrophic forgetting, we impose a theoretical condition on learning rates for $f$ and $g$.
Figure 2: Metrics for continual learning (CL) algorithms trained on split-CIFAR100 with different 5 seeds . (a) Forgetting versus $\sum \mathbb{E}[\Gamma_t]$ at the end of training. (b) Evolution of $\sum \mathbb{E}[\Gamma_t]$ during continual learning. (c) Empirical verification of the relation between $\lVert \nabla f(x) \rVert$ for the first task and test loss of the first task in split CIFAR-100. (d)-(e) are the empirical verification of $\sum \mathbb{E}[\Gamma_t]$ versus $\lVert \nabla f(x) \rVert$ for the first task in CL algorithms. The red horizontal line indicates the empirical $\lVert \nabla f(x) \rVert$ right after training the first task. (f) Illustration of empirical $B_t$ at the end of each task.

Theorems & Definitions (20)

Lemma 4.1
Lemma 4.2
Theorem 4.3
Lemma 4.4
Lemma 4.5
Theorem 4.6
Corollary 4.7
Remark 4.8
proof : Derivation of Equation \ref{['eq:changelsmooth']}
Lemma C.1
...and 10 more

On the Convergence of Continual Learning with Adaptive Methods

TL;DR

Abstract

On the Convergence of Continual Learning with Adaptive Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (20)