Doubly Perturbed Task Free Continual Learning

Byung Hyun Lee; Min-hwan Oh; Se Young Chun

Doubly Perturbed Task Free Continual Learning

Byung Hyun Lee, Min-hwan Oh, Se Young Chun

TL;DR

This work tackles Task Free Continual Learning (TF-CL), where a model learns sequentially without explicit task labels and faces forgetting of past tasks. It introduces Doubly Perturbed Continual Learning (DPCL), which derives an upper bound on the TF-CL objective under adversarial input perturbations and weight perturbations, and implements it via Perturbed Function Interpolation (PFI) for inputs and Branched Stochastic Classifiers (BSC) for weights, complemented by perturbation-induced memory management (PIMA) and an adaptive learning-rate strategy. The approach yields a practical surrogate objective that stabilizes learning across tasks and improves generalization by flattening both input and weight loss landscapes. Empirically, DPCL outperforms strong rehearsal- and perturbation-based baselines on CIFAR100, CIFAR100-SC, and ImageNet-100 across disjoint and blurred task configurations, showing consistent gains and robustness to memory constraints and task boundaries.

Abstract

Task Free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks.

Doubly Perturbed Task Free Continual Learning

TL;DR

Abstract

Paper Structure (29 sections, 4 theorems, 25 equations, 5 figures, 13 tables, 3 algorithms)

This paper contains 29 sections, 4 theorems, 25 equations, 5 figures, 13 tables, 3 algorithms.

Introduction
Related Works
Continual Learning (CL)
Task Free Continual Learning (TF-CL)
Input and Weight Perturbations
Problem Formulation
Revisiting Conventional TF-CL
Novel TF-CL Considering a Future Sample
Doubly Perturbed Task Free Continual Learning
Efficient Optimization for Doubly Perturbed Task Free Continual Learning
Perturbed Function Interpolation
Branched Stochastic Classifiers
Perturbation-Induced Memory Management and Adaptive Learning Rate
Experiments
Experimental Setups
...and 14 more sections

Key Result

Proposition 1

Assume that $\mathcal{L}_t(\theta)$ is Lipschitz continuous for all $t$ and $\phi'$ is updated with finite gradient steps from $\phi^t$, so that $\phi'$ is a bounded random variable and $\eta_{2}^t < \infty$ with high probability. Then, the upper-bound for the loss (objective_proposed) is where $\mathcal{L}_{t,\Delta}(\theta) = \ell (h(x^t+\Delta x; \theta), y^t)$.

Figures (5)

Figure 1: (Left) Input loss landscape of TF-CL when the weight $\theta^t$ has been determined for sample $x^t$. We desire $\ell(h(x;\theta^t),y)$ to be flat about $x^t$ so that the loss for $x^{\tau}, \tau\in[1,\cdots,t-1, t+1]$ do not fluctuate significantly from $x^t$. (Right) Weight loss landscape of TF-CL where $\phi$ gets shifted from $\phi^{t}$ to $\phi^{t+1}$ by training for new sample $x^{t+1}$. We desire $\ell(h(x^t;[\theta_{e};\phi]),y^t)$ to be flat about $\phi^t$ so that the loss for $x^t$ doesn't increase dramatically when $\phi$ shifts from $\phi^t$ to $\phi^{t+1}$.
Figure 2: Illustration of Perturbed Function Interpolation (PFI) and Branched Stochastic Classifiers (BSC). PFI randomly perturbs the input, which makes the input loss landscape smooth. For weight perturbation, branched stochastic classifier utilizes weight average along the training trajectory, introduces multiple classifiers, and conduct variational inference during test.
Figure 3: Any-time inference results on CIFAR100, CIFAR100-SC, and ImageNet-100. Each point represents average accuracy over 5 different random seeds and the shaded area represents the standard deviation($\pm$) around the average accuracy.
Figure 4: t-SNE on the features at the end of the encoder with CIFAR100. We computed the features and losses for samples in first task after training the last $5$th task. The color represents the loss of a sample (yellow for high loss and purple for low loss). We can see that our DPCL has overall low loss for all regions, especially near the class boundaries.
Figure S1: Weight loss landscape for data from first task after training for (a) first, (b) third, and (c) fifth task on CIFAR100 dataset. We randomly selected the direction for perturbation and averaged results from 5 random seeds. We can see that our DPCL has the flattest weight loss landscape for all cases and the lowest loss values at the origin.

Theorems & Definitions (7)

Proposition 1
Proposition 2
Proposition 1
proof
Remark
Proposition 2
proof

Doubly Perturbed Task Free Continual Learning

TL;DR

Abstract

Doubly Perturbed Task Free Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)