CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Erum Mushtaq; Duygu Nur Yaldiz; Yavuz Faruk Bakman; Jie Ding; Chenyang Tao; Dimitrios Dimitriadis; Salman Avestimehr

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Erum Mushtaq, Duygu Nur Yaldiz, Yavuz Faruk Bakman, Jie Ding, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr

TL;DR

This work addresses class-incremental continual self-supervised learning (CSSL) by identifying task confusion as a critical yet underexplored challenge. It introduces CroMo-Mixup, a two-part framework with Cross-Task Data Mixup and Cross-Model Feature Mixup that diversifies negatives and learns cross-task similarities using embeddings from both current and old models, while optionally applying distillation. The approach is shown to be compatible with four SSL objectives and yields consistent improvements in average linear accuracy and Task-ID prediction across CIFAR10, CIFAR100, and TinyImageNet splits, often surpassing CaSSLe and CaSSLe+ under limited memory budgets. These results suggest CroMo-Mixup effectively enhances cross-task class separation and old-knowledge retrieval, with practical implications for scalable CSSL in unlabeled, sequential data settings. Limitations include dependency on a memory buffer and explicit task boundaries; future work could explore privacy-preserving replay and smoother task transitions.

Abstract

Continual self-supervised learning (CSSL) learns a series of tasks sequentially on the unlabeled data. Two main challenges of continual learning are catastrophic forgetting and task confusion. While CSSL problem has been studied to address the catastrophic forgetting challenge, little work has been done to address the task confusion aspect. In this work, we show through extensive experiments that self-supervised learning (SSL) can make CSSL more susceptible to the task confusion problem, particularly in less diverse settings of class incremental learning because different classes belonging to different tasks are not trained concurrently. Motivated by this challenge, we present a novel cross-model feature Mixup (CroMo-Mixup) framework that addresses this issue through two key components: 1) Cross-Task data Mixup, which mixes samples across tasks to enhance negative sample diversity; and 2) Cross-Model feature Mixup, which learns similarities between embeddings obtained from current and old models of the mixed sample and the original images, facilitating cross-task class contrast learning and old knowledge retrieval. We evaluate the effectiveness of CroMo-Mixup to improve both Task-ID prediction and average linear accuracy across all tasks on three datasets, CIFAR10, CIFAR100, and tinyImageNet under different class-incremental learning settings. We validate the compatibility of CroMo-Mixup on four state-of-the-art SSL objectives. Code is available at \url{https://github.com/ErumMushtaq/CroMo-Mixup}.

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

TL;DR

Abstract

Paper Structure (38 sections, 10 equations, 8 figures, 7 tables)

This paper contains 38 sections, 10 equations, 8 figures, 7 tables.

Introduction
Preliminaries
Self-Supervised Learning
Problem Definition and Evaluation Setup
Continual Self-Supervised Learning
Evaluation of Class Incremental Self-Supervised Learning
Challenges of Class Incremental Self-Supervised Learning
Catastrophic Forgetting
Task Confusion
Proposed Method
Cross-Task Data Mixup
Cross-Model Feature Mixup (CroMo-Mixup)
Related Works
Experiments
Experiment Settings
...and 23 more sections

Figures (8)

Figure 1: Illustration of our proposed CroMo-Mixup framework. At the input, cross-task mixed samples are generated by a convex interpolation of the current and old task samples from the memory buffer. At the output, the model learns similarities between the embeddings of the cross-task mixed sample and the original samples that were mixed to create it. The embeddings of memory buffer samples come from the frozen network saved from the old task (t-1), whereas mixed samples and current task sample embeddings are attained from the network of the current task (t). In addition, model learns current task via task-specific SSL loss and distills old knowledge on the current task samples via a temporal projector-based distillation loss.
Figure 2: Demonstration of Catastrophic Forgetting and Task Confusion challenges in a two-task based Continual Learning setup where each task contains three classes. Figure (a) illustrates the linear separability of latent vectors of task 1 classes at the end of task 1 training. Figures (b)-(e) represent the four cases after training on task 2. Case (b) shows the desired case where all classes of both tasks are linearly separable. Figure (c) illustrates the forgetting effect where task 2 classes are linearly separable but task 1 classes are not. Figure (d) shows the task confusion problem, where the model fails to draw distinctive decision boundaries between different task classes and may have overlapping clusters. Figure (e) shows the effects of task confusion and forgetting together, which is the problem in CSSL settings we want to solve.
Figure 3: Depiction of 100x1 and 10x10 CIL-minibatch task confusion experiment setup on the CIFAR100 dataset. Figure (a) represents the 100x1 case where a regular uniform sampling is performed from all the samples containing all 100 classes. Figure (b) shows the 10x10 setting where there are 10 tasks and each task contains only 10 classes. Classes are mutually exclusive across tasks. For SSL training, a mini-batch is sampled only from a single task at a time. After each iteration, mini-batch sampler moves to the next task so that tasks can be revisited throughout the training.
Figure 4: Training LA, WP, and TP performance of contrastive SSL methods, CorInfomax infomax, Barlow-Twins barlowtwins, SimCLR simclr, and BYOL byol, and supervised learning on the CIFAR100 Dataset for 100x1 and 10x10 CIL-minibatch settings. Figure (a) demonstrates that the 10x10 setting leads to a significant accuracy drop across all SSL baselines as compared to the 100x1 setting. Figure (b) presents that the lower linear accuracy is reflected in lower task-id prediction performance, demonstrating the task-confusion problem. Figure (c) shows that the WP performance remains relatively good.
Figure 5: Training LA, WP, and TP performance of contrastive SSL methods and supervised learning on the CIFAR100 Dataset for both 100x1 and 10x10 DIL-minibatch settings. Figures (a), (b), and (c) demonstrate that the 10x10 setting performs as good as the 100x1 setting in terms of LA, TP, and WP, respectively.
...and 3 more figures

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

TL;DR

Abstract

CroMo-Mixup: Augmenting Cross-Model Representations for Continual Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)