Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

Jiyong Li; Dilshod Azizov; Yang Li; Shangsong Liang

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

Jiyong Li, Dilshod Azizov, Yang Li, Shangsong Liang

TL;DR

Experiments reveal that the proposed Contrastive Continual Learning via Importance Sampling (CCLIS) method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts.

Abstract

Recently, because of the high-quality representations of contrastive learning methods, rehearsal-based contrastive continual learning has been proposed to explore how to continually learn transferable representation embeddings to avoid the catastrophic forgetting issue in traditional continual settings. Based on this framework, we propose Contrastive Continual Learning via Importance Sampling (CCLIS) to preserve knowledge by recovering previous data distributions with a new strategy for Replay Buffer Selection (RBS), which minimize estimated variance to save hard negative samples for representation learning with high quality. Furthermore, we present the Prototype-instance Relation Distillation (PRD) loss, a technique designed to maintain the relationship between prototypes and sample representations using a self-distillation process. Experiments on standard continual learning benchmarks reveal that our method notably outperforms existing baselines in terms of knowledge preservation and thereby effectively counteracts catastrophic forgetting in online contexts. The code is available at https://github.com/lijy373/CCLIS.

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

TL;DR

Abstract

Paper Structure (28 sections, 1 theorem, 38 equations, 3 figures, 9 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 38 equations, 3 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Background
Problem Setup: Continual Learning
Preliminaries: Contrastive Learning
Preliminaries: Importance Sampling
Contrastive Continual Learning via Importance Sampling
Overview of Our Model
Prototype-based InfoNCE Loss via Importance Sampling
Replay Buffer Selection for Estimated Variance Minimization
PRD for Contrastive Continual Learning
Objective Function
Experimental Setup
Results and Discussions
RQ1:Performance on Squentially Arriving Tasks
...and 13 more sections

Key Result

Theorem 1

Assuming that the gradients of score functions are bounded, i.e., $||\nabla_{\theta}s_{ij}||_2 \leq M, \forall i,j$, we can have the following bound on the mean square error between the estimator $\hat{\mu}_{ij}$ and the gradient $\mu_{ij}$ for specific prototype $i$ and sample $j$: where $g^{(m)}$ is the proposal distribution and $\omega_i^{(m)}$ is the importance weight for the specific class $

Figures (3)

Figure 1: Illustration of Contrastive Learning via Importance Sampling and PRD Loss. (a) When new tasks are introduced, buffer samples are drawn with specific sampling weights. By using the Importance Sampling technique, we approximately recover the data distributions of previous tasks and apply prototype-based contrastive learning on previous and current data to have high-quality features. (b) Given samples of a mini-batch, the PRD loss is designed to distill the relation between prototypes and instances from the previous model to the current one. We minimize the cross-entropy of prototype-instance similarity from the current and previous models with frozen parameters, which are computed with dot products of normalized embeddings.
Figure 2: Performance variant with the distill power $\lambda$ in Seq-CIFAR-10 under Class-IL scenario. PRD effectively enhances the performance of importance sampling-based contrastive learning by successfully maintaining the prototype-instance relationship.
Figure 3: Top: t-SNE visualization of feature embeddings from replay buffer (colored) and all (gray) training samples of Seq-Cifar-10. Bottom: Similar to Top, but all samples are colored to distinguish different clusters clearly. Left: the buffer features drawn by Co2L, a contrastive continual learning algorithm with random sampling, are spread uniformly in clusters. Right: The buffer features sampled by CCLIS are mainly distributed at the edge of the clusters. These can be viewed as hard negatives of other classes to help the model learn high-quality contrastive representations.

Theorems & Definitions (2)

Theorem 1
Proof 1

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

TL;DR

Abstract

Contrastive Continual Learning with Importance Sampling and Prototype-Instance Relation Distillation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (2)