Fixed Design Analysis of Regularization-Based Continual Learning

Haoran Li; Jingfeng Wu; Vladimir Braverman

Fixed Design Analysis of Regularization-Based Continual Learning

Haoran Li, Jingfeng Wu, Vladimir Braverman

TL;DR

It is suggested that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned $\ell_2$-regularization can partially mitigate this issue by introducing intransigence.

Abstract

We consider a continual learning (CL) problem with two linear regression tasks in the fixed design setting, where the feature vectors are assumed fixed and the labels are assumed to be random variables. We consider an $\ell_2$-regularized CL algorithm, which computes an Ordinary Least Squares parameter to fit the first dataset, then computes another parameter that fits the second dataset under an $\ell_2$-regularization penalizing its deviation from the first parameter, and outputs the second parameter. For this algorithm, we provide tight bounds on the average risk over the two tasks. Our risk bounds reveal a provable trade-off between forgetting and intransigence of the $\ell_2$-regularized CL algorithm: with a large regularization parameter, the algorithm output forgets less information about the first task but is intransigent to extract new information from the second task; and vice versa. Our results suggest that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned $\ell_2$-regularization can partially mitigate this issue by introducing intransigence.

Fixed Design Analysis of Regularization-Based Continual Learning

TL;DR

It is suggested that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned

-regularization can partially mitigate this issue by introducing intransigence.

Abstract

-regularized CL algorithm, which computes an Ordinary Least Squares parameter to fit the first dataset, then computes another parameter that fits the second dataset under an

-regularization penalizing its deviation from the first parameter, and outputs the second parameter. For this algorithm, we provide tight bounds on the average risk over the two tasks. Our risk bounds reveal a provable trade-off between forgetting and intransigence of the

-regularized CL algorithm: with a large regularization parameter, the algorithm output forgets less information about the first task but is intransigent to extract new information from the second task; and vice versa. Our results suggest that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned

-regularization can partially mitigate this issue by introducing intransigence.

Paper Structure (42 sections, 5 theorems, 90 equations, 2 figures)

This paper contains 42 sections, 5 theorems, 90 equations, 2 figures.

Introduction
Contributions.
Preliminaries
Two linear regression tasks.
Continual learning.
Ordinary continual learning.
$\ell_2$-Regularized continual learning.
A forgetting-intransigence decomposition.
Main Results
Assumptions
Notations.
Risk Bounds for Joint Learning
The effect of JL.
Risk Bounds for Continual Learning
The effect of OCL.
...and 27 more sections

Key Result

Proposition 1

Suppose that Assumptions assump:fixed-design to assump:commutable hold. Then for the JL output eqn:joint-learning, it holds that

Figures (2)

Figure 1: Expected excess risk vs. sample size for JL, OCL and $\ell_2$-RCL. For each point in each curve, the x-axis represents the sample size $n$ and the y-axis represents the expected CL excess risk (in the logarithmic scale). The problem instances $\mathtt{Q}(5,0)$, $\mathtt{Q}(15,0)$ and $\mathtt{Q}(15,15)$ are defined by \ref{['eqn:setting-q']}. The regularization parameter in the $\ell_2$-RCL algorithm is optimally tuned. The dimension of the task is $d=200$. The expectation of the excess risk is computed by taking an empirical average over $20$ independent runs.
Figure 2: The trade-off between forgetting and intransigence for $\ell_2$-RCL. For each point in each curve, the x-axis represents the sample size $n$ and the y-axis represents the expected CL excess risk or the expected forgetting or the expected intransigence. The problem instances $\mathtt{Q}(15,0)$ and $\mathtt{Q}(15,15)$ are defined by \ref{['eqn:setting-q']}. The dimension of the task is $d=200$. The sample size is $n=2,000$. The expectation is computed by taking an empirical average over $20$ independent runs.

Theorems & Definitions (18)

Remark 1: Memory size
Remark 2: OCL in the interpolation regime
Remark 3: Generalization to CL with $T$ tasks
Proposition 1: A risk bound for JL
Theorem 1: A risk bound for $\ell_2$-RCL/OCL
Corollary 1: Risk upper bounds for $\ell_2$-RCL
Example 1
Example 2
Example 3
proof : Proof of Theorem \ref{['thm:regularized-learning']}
...and 8 more

Fixed Design Analysis of Regularization-Based Continual Learning

TL;DR

Abstract

Fixed Design Analysis of Regularization-Based Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)