Table of Contents
Fetching ...

Fixed Design Analysis of Regularization-Based Continual Learning

Haoran Li, Jingfeng Wu, Vladimir Braverman

TL;DR

It is suggested that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned $\ell_2$-regularization can partially mitigate this issue by introducing intransigence.

Abstract

We consider a continual learning (CL) problem with two linear regression tasks in the fixed design setting, where the feature vectors are assumed fixed and the labels are assumed to be random variables. We consider an $\ell_2$-regularized CL algorithm, which computes an Ordinary Least Squares parameter to fit the first dataset, then computes another parameter that fits the second dataset under an $\ell_2$-regularization penalizing its deviation from the first parameter, and outputs the second parameter. For this algorithm, we provide tight bounds on the average risk over the two tasks. Our risk bounds reveal a provable trade-off between forgetting and intransigence of the $\ell_2$-regularized CL algorithm: with a large regularization parameter, the algorithm output forgets less information about the first task but is intransigent to extract new information from the second task; and vice versa. Our results suggest that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned $\ell_2$-regularization can partially mitigate this issue by introducing intransigence.

Fixed Design Analysis of Regularization-Based Continual Learning

TL;DR

It is suggested that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned -regularization can partially mitigate this issue by introducing intransigence.

Abstract

We consider a continual learning (CL) problem with two linear regression tasks in the fixed design setting, where the feature vectors are assumed fixed and the labels are assumed to be random variables. We consider an -regularized CL algorithm, which computes an Ordinary Least Squares parameter to fit the first dataset, then computes another parameter that fits the second dataset under an -regularization penalizing its deviation from the first parameter, and outputs the second parameter. For this algorithm, we provide tight bounds on the average risk over the two tasks. Our risk bounds reveal a provable trade-off between forgetting and intransigence of the -regularized CL algorithm: with a large regularization parameter, the algorithm output forgets less information about the first task but is intransigent to extract new information from the second task; and vice versa. Our results suggest that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned -regularization can partially mitigate this issue by introducing intransigence.
Paper Structure (42 sections, 5 theorems, 90 equations, 2 figures)

This paper contains 42 sections, 5 theorems, 90 equations, 2 figures.

Key Result

Proposition 1

Suppose that Assumptions assump:fixed-design to assump:commutable hold. Then for the JL output eqn:joint-learning, it holds that

Figures (2)

  • Figure 1: Expected excess risk vs. sample size for JL, OCL and $\ell_2$-RCL. For each point in each curve, the x-axis represents the sample size $n$ and the y-axis represents the expected CL excess risk (in the logarithmic scale). The problem instances $\mathtt{Q}(5,0)$, $\mathtt{Q}(15,0)$ and $\mathtt{Q}(15,15)$ are defined by \ref{['eqn:setting-q']}. The regularization parameter in the $\ell_2$-RCL algorithm is optimally tuned. The dimension of the task is $d=200$. The expectation of the excess risk is computed by taking an empirical average over $20$ independent runs.
  • Figure 2: The trade-off between forgetting and intransigence for $\ell_2$-RCL. For each point in each curve, the x-axis represents the sample size $n$ and the y-axis represents the expected CL excess risk or the expected forgetting or the expected intransigence. The problem instances $\mathtt{Q}(15,0)$ and $\mathtt{Q}(15,15)$ are defined by \ref{['eqn:setting-q']}. The dimension of the task is $d=200$. The sample size is $n=2,000$. The expectation is computed by taking an empirical average over $20$ independent runs.

Theorems & Definitions (18)

  • Remark 1: Memory size
  • Remark 2: OCL in the interpolation regime
  • Remark 3: Generalization to CL with $T$ tasks
  • Proposition 1: A risk bound for JL
  • Theorem 1: A risk bound for $\ell_2$-RCL/OCL
  • Corollary 1: Risk upper bounds for $\ell_2$-RCL
  • Example 1
  • Example 2
  • Example 3
  • proof : Proof of Theorem \ref{['thm:regularized-learning']}
  • ...and 8 more