Integrating Present and Past in Unsupervised Continual Learning

Yipeng Zhang; Laurent Charlin; Richard Zemel; Mengye Ren

Integrating Present and Past in Unsupervised Continual Learning

Yipeng Zhang, Laurent Charlin, Richard Zemel, Mengye Ren

TL;DR

This work tackles forgetting and representation overlap in unsupervised continual learning by proposing a unifying framework that separately optimizes three objectives—$L_{ ext{current}}$, $L_{ ext{cross}}$, and $L_{ ext{past}}$—to capture plasticity, cross-task consolidation, and stability. It introduces Osiris, a method that uses isolated embedding spaces and multiple projectors to realize these objectives, achieving state-of-the-art performance on standard benchmarks like Split-CIFAR-100 and on newly proposed structured benchmarks that mimic realistic learning environments. The study also analyzes the trade-offs between replay and distillation for stability, and demonstrates that the proposed structured task sequences can even surpass offline iid learning in some settings. Overall, Osiris provides a principled path to more robust continual learners by explicitly integrating present and past learning signals and by highlighting the importance of task structure and normalization choices for UCL.

Abstract

We formulate a unifying framework for unsupervised continual learning (UCL), which disentangles learning objectives that are specific to the present and the past data, encompassing stability, plasticity, and cross-task consolidation. The framework reveals that many existing UCL approaches overlook cross-task consolidation and try to balance plasticity and stability in a shared embedding space. This results in worse performance due to a lack of within-task data diversity and reduced effectiveness in learning the current task. Our method, Osiris, which explicitly optimizes all three objectives on separate embedding spaces, achieves state-of-the-art performance on all benchmarks, including two novel benchmarks proposed in this paper featuring semantically structured task sequences. Compared to standard benchmarks, these two structured benchmarks more closely resemble visual signals received by humans and animals when navigating real-world environments. Finally, we show some preliminary evidence that continual models can benefit from such realistic learning scenarios.

Integrating Present and Past in Unsupervised Continual Learning

TL;DR

This work tackles forgetting and representation overlap in unsupervised continual learning by proposing a unifying framework that separately optimizes three objectives—

, and

—to capture plasticity, cross-task consolidation, and stability. It introduces Osiris, a method that uses isolated embedding spaces and multiple projectors to realize these objectives, achieving state-of-the-art performance on standard benchmarks like Split-CIFAR-100 and on newly proposed structured benchmarks that mimic realistic learning environments. The study also analyzes the trade-offs between replay and distillation for stability, and demonstrates that the proposed structured task sequences can even surpass offline iid learning in some settings. Overall, Osiris provides a principled path to more robust continual learners by explicitly integrating present and past learning signals and by highlighting the importance of task structure and normalization choices for UCL.

Abstract

Paper Structure (51 sections, 1 theorem, 13 equations, 11 figures, 8 tables)

This paper contains 51 sections, 1 theorem, 13 equations, 11 figures, 8 tables.

Introduction
Preliminaries
Self-Supervised Learning
Generalized contrastive loss.
Unsupervised Continual Learning
Dissecting the Learning Objective of UCL
Three Desirable Properties
Plasticity and stability.
Cross-task consolidation.
Osiris: Integrating Objectives of Present and Past
Plasticity Loss
Stability Loss
Osiris-D(istillation).
Osiris-R(eplay).
Remark.
...and 36 more sections

Key Result

Proposition 1

Let $\nu \sim \text{Beta}(\alpha, \alpha)$ and let $\mathcal{L}^{\text{LUMP}}(X, Y; \nu, f_\Theta, g_\Phi) \coloneqq \mathcal{L}^{\text{SSL}}(\Tilde{X}; f_\Theta, g_\Phi)$ be as described above. Define ${\bm{z}}_i \coloneqq g(f({\bm{x}}_i)) / \|g(f({\bm{x}}_i))\|_2$, ${\bm{u}}_i \coloneqq g(f({\bm{y where $a_{(\cdot)}, b_{(\cdot)}, c_{(\cdot)}, d_{(\cdot)} \geq 0$ are scalar functions of $\alpha$

Figures (11)

Figure 1: Left: illustration of our method. Dashed arrows denote optional computations because the stability loss $\mathcal{L}_{\text{past}}$ can be achieved through distillation or replay. Right: conceptual loss space. A separate projector helps with optimization.
Figure 2: KNN accuracy of methods trained on the 20-task Split-CIFAR-100 with BatchNorm (BN) or GroupNorm (GN). $\mathcal{D}_1$-Only denotes an offline model trained for the same number of steps but only on the first task. CaSSLefini2022self and LUMPmadaan2022representational are state-of-the-art UCL methods. The incompatibility between BN and UCL can be mitigated by using GN instead.
Figure 3: (a) Interplay between plasticity (current-task accuracy), cross-task consolidation (task-level KNN accuracy), and stability (accuracy of the first task throughout training). Osiris-D balances the three aspects well and is usually the top performer. (b) Relative difference between the contrastive loss on past-task data and on memory for replay-based methods. All methods except Osiris-D show signs of overfitting.
Figure 4: Mean cosine similarity between pairs of examples drawn from pairs of classes. Environment switches are marked with dashed white lines. Classes within the third environment are projected to nearby positions on the representation space by Offline, but not by FT and Osiris-D.
Figure 5: Relative difference between contrastive loss on past-task data and on memory for replay-based methods. (a) All curves are calculated with the projector outputs where $\mathcal{L}_{\text{current}}$ is applied, i.e., with $g \circ f$. (b) Same as Fig. \ref{['fig:exp:overfit']}, for Osiris-D and Osiris-R, we plot the curves calculated with the outputs of $h \circ f$ where $\mathcal{L}_{\text{past}}$ is applied. The curves for ER and LUMP are still calculated on their only projector branch, i.e, $g \circ f$. Osiris-D does not overfit on either branches.
...and 6 more figures

Theorems & Definitions (2)

Proposition 1
proof

Integrating Present and Past in Unsupervised Continual Learning

TL;DR

Abstract

Integrating Present and Past in Unsupervised Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (2)