Table of Contents
Fetching ...

Provable Contrastive Continual Learning

Yichen Wen, Zhiquan Tan, Kaipeng Zheng, Chuanlong Xie, Weiran Huang

TL;DR

This work addresses the theoretical gap in contrastive continual learning by deriving bounds that link the final test performance to the sequence of task-specific training losses, revealing how distillation and contrastive losses interact over time. It introduces CILA, an algorithm that adapts the distillation coefficient per task based on empirical loss ratios to tighten these guarantees in practice. Through theoretical results and extensive experiments on Seq-CIFAR-10, Seq-Tiny-ImageNet, and R-MNIST, CILA achieves state-of-the-art performance and demonstrates the practical importance of pre-training and adaptive regulation. The findings provide a principled framework for representation-based continual learning with adaptive regularization, highlighting how carefully balancing plasticity and memory stability yields robust continual learners.

Abstract

Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.

Provable Contrastive Continual Learning

TL;DR

This work addresses the theoretical gap in contrastive continual learning by deriving bounds that link the final test performance to the sequence of task-specific training losses, revealing how distillation and contrastive losses interact over time. It introduces CILA, an algorithm that adapts the distillation coefficient per task based on empirical loss ratios to tighten these guarantees in practice. Through theoretical results and extensive experiments on Seq-CIFAR-10, Seq-Tiny-ImageNet, and R-MNIST, CILA achieves state-of-the-art performance and demonstrates the practical importance of pre-training and adaptive regulation. The findings provide a principled framework for representation-based continual learning with adaptive regularization, highlighting how carefully balancing plasticity and memory stability yields robust continual learners.

Abstract

Continual learning requires learning incremental tasks with dynamic data distributions. So far, it has been observed that employing a combination of contrastive loss and distillation loss for training in continual learning yields strong performance. To the best of our knowledge, however, this contrastive continual learning framework lacks convincing theoretical explanations. In this work, we fill this gap by establishing theoretical performance guarantees, which reveal how the performance of the model is bounded by training losses of previous tasks in the contrastive continual learning framework. Our theoretical explanations further support the idea that pre-training can benefit continual learning. Inspired by our theoretical analysis of these guarantees, we propose a novel contrastive continual learning algorithm called CILA, which uses adaptive distillation coefficients for different tasks. These distillation coefficients are easily computed by the ratio between average distillation losses and average contrastive losses from previous tasks. Our method shows great improvement on standard benchmarks and achieves new state-of-the-art performance.
Paper Structure (18 sections, 8 theorems, 72 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 8 theorems, 72 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

When $t\ge 2$, for any data distribution $\mathcal{D}$, the contrastive losses of current model $f_t$ and previous model $f_{t-1}$ can be connected via the distillation loss, i.e., where $\alpha = \frac{2e^2}{1+e^2}$, $\beta = 2-\alpha+\alpha \log \frac{\alpha}{2}$, and $\beta' = -\alpha \log(1+e^2) - \alpha$.

Figures (2)

  • Figure 1: An illustration of contrastive continual learning framework. At the end of the previous task, we restore the previous model and values of losses. For the current task, augmentations are applied to both the buffered and the current data. Then the augmented data is passed through the current model and the previous frozen model to obtain representations. The target loss of contrastive continual learning is a weighted sum of contrastive loss and distillation loss with a distillation coefficient $\lambda$.
  • Figure 2: An illustration of Example \ref{['example2']}. The suggesting $\lambda$ for $\rho=0.95$ or $\rho=1.05$ stays close to one.

Theorems & Definitions (16)

  • Lemma 1
  • Theorem 1
  • Example 1
  • Example 2
  • Example 3
  • Theorem 2
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • ...and 6 more