Table of Contents
Fetching ...

Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting

Bo Liu, Mao Ye, Peter Stone, Qiang Liu

TL;DR

This work analyzes an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction, and proposes two approaches that more flexibly constrain the update direction.

Abstract

A fundamental challenge in continual learning is to balance the trade-off between learning new tasks and remembering the previously acquired knowledge. Gradient Episodic Memory (GEM) achieves this balance by utilizing a subset of past training samples to restrict the update direction of the model parameters. In this work, we start by analyzing an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction. We show that memory strength is effective mainly because it improves GEM's generalization ability and therefore leads to a more favorable trade-off. By this finding, we propose two approaches that more flexibly constrain the update direction. Our methods are able to achieve uniformly better Pareto Frontiers of remembering old and learning new knowledge than using memory strength. We further propose a computationally efficient method to approximately solve the optimization problem with more constraints.

Fine-Grained Gradient Restriction: A Simple Approach for Mitigating Catastrophic Forgetting

TL;DR

This work analyzes an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction, and proposes two approaches that more flexibly constrain the update direction.

Abstract

A fundamental challenge in continual learning is to balance the trade-off between learning new tasks and remembering the previously acquired knowledge. Gradient Episodic Memory (GEM) achieves this balance by utilizing a subset of past training samples to restrict the update direction of the model parameters. In this work, we start by analyzing an often overlooked hyper-parameter in GEM, the memory strength, which boosts the empirical performance by further constraining the update direction. We show that memory strength is effective mainly because it improves GEM's generalization ability and therefore leads to a more favorable trade-off. By this finding, we propose two approaches that more flexibly constrain the update direction. Our methods are able to achieve uniformly better Pareto Frontiers of remembering old and learning new knowledge than using memory strength. We further propose a computationally efficient method to approximately solve the optimization problem with more constraints.
Paper Structure (35 sections, 3 theorems, 35 equations, 5 figures, 2 tables)

This paper contains 35 sections, 3 theorems, 35 equations, 5 figures, 2 tables.

Key Result

Proposition 1

Given some search space $\mathcal{Z}$ of $\boldsymbol{z}$, suppose $\boldsymbol{z}^{*}$ is the solution of the following problem Given any $\delta>0$, with probability at least $1-\delta$, we have where $\Delta$ is the generalization gap and $\mathfrak{R}_{\left|\hat{\mathcal{D}}_{s}\right|}\left[\mathcal{Z}\right]$ denotes the Rademacher complexity of set $\mathcal{Z}$.

Figures (5)

  • Figure 1: When progressing from task 1 to task 2, GEM computes $\tilde{g}_1$, the gradient on the episodic memory stored for task $1$. Then GEM ensures that the update direction is within the half-space defined by $\tilde{g}_1$. We propose to decompose $\tilde{g}_1$ into multiple vectors, i.e. $\tilde{g}_1^1, \tilde{g}_1^2$. The search space then becomes more constrained based on how we divide $\tilde{g}_1$.
  • Figure 2: Trade-off between $\langle \boldsymbol{g}_s, \boldsymbol{z} \rangle$ (x-axis) vs. $\langle \boldsymbol{g}_t, \boldsymbol{z} \rangle$ (y-axis), where $\boldsymbol{z}$ is the update direction found by mGEM/GEM. Here, d/p-mGEM($n$) denotes there are $n$ modules in total. For example, p-mGEM(2) means we divide the model parameters into 2 groups and separately apply GEM on each.
  • Figure 3: GEM and mGEM's Pareto Frontier of FWD and BWD on Split CIFAR100. The dashed gray lines are linear functions $y = -x + c$, where $c \in \mathbb{R}$. Thus points on the same dashed line have the same ACC. Numbers in the parenthesis denote how many modules we have for mGEM.
  • Figure 4: Performance of mGEM versus GEM on four commonly used neural architectures. The underlying continual learning task is the 10-way Split CIFAR100 benchmark. Here we report 3 runs with 3 different seeds for each architecture.
  • Figure 5: Sampled images from the Digit-Five (left) and DomainNet (right) datasets.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Lemma 1