Table of Contents
Fetching ...

DELTA: Decoupling Long-Tailed Online Continual Learning

Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

TL;DR

DELTA tackles long-tailed online continual learning by decoupling representation learning from classifier learning in a two-stage pipeline. Stage 1 uses supervised contrastive loss $L_{contrastive}$ to learn robust representations from the streaming data and memory, while Stage 2 freezes the backbone and trains with Equalization Loss $L_{EQ}$ using a task-specific distribution vector $P(k^t)$ to reweight logits $O^t(I_x)$. A multi-exemplar learning strategy pairs multiple exemplars from memory with each input to balance batches and reduce gradient variance. On CIFAR-100-LT and VFN-LT, DELTA consistently surpasses existing OCL methods across memory sizes and task configurations; ablations confirm the contributions of dual-stage decoupling, $L_{EQ}$, and multi-exemplar pairing. These results suggest strong potential for real-world online learning under severe long-tailed distributions.

Abstract

A significant challenge in achieving ubiquitous Artificial Intelligence is the limited ability of models to rapidly learn new information in real-world scenarios where data follows long-tailed distributions, all while avoiding forgetting previously acquired knowledge. In this work, we study the under-explored problem of Long-Tailed Online Continual Learning (LTOCL), which aims to learn new tasks from sequentially arriving class-imbalanced data streams. Each data is observed only once for training without knowing the task data distribution. We present DELTA, a decoupled learning approach designed to enhance learning representations and address the substantial imbalance in LTOCL. We enhance the learning process by adapting supervised contrastive learning to attract similar samples and repel dissimilar (out-of-class) samples. Further, by balancing gradients during training using an equalization loss, DELTA significantly enhances learning outcomes and successfully mitigates catastrophic forgetting. Through extensive evaluation, we demonstrate that DELTA improves the capacity for incremental learning, surpassing existing OCL methods. Our results suggest considerable promise for applying OCL in real-world applications.

DELTA: Decoupling Long-Tailed Online Continual Learning

TL;DR

DELTA tackles long-tailed online continual learning by decoupling representation learning from classifier learning in a two-stage pipeline. Stage 1 uses supervised contrastive loss to learn robust representations from the streaming data and memory, while Stage 2 freezes the backbone and trains with Equalization Loss using a task-specific distribution vector to reweight logits . A multi-exemplar learning strategy pairs multiple exemplars from memory with each input to balance batches and reduce gradient variance. On CIFAR-100-LT and VFN-LT, DELTA consistently surpasses existing OCL methods across memory sizes and task configurations; ablations confirm the contributions of dual-stage decoupling, , and multi-exemplar pairing. These results suggest strong potential for real-world online learning under severe long-tailed distributions.

Abstract

A significant challenge in achieving ubiquitous Artificial Intelligence is the limited ability of models to rapidly learn new information in real-world scenarios where data follows long-tailed distributions, all while avoiding forgetting previously acquired knowledge. In this work, we study the under-explored problem of Long-Tailed Online Continual Learning (LTOCL), which aims to learn new tasks from sequentially arriving class-imbalanced data streams. Each data is observed only once for training without knowing the task data distribution. We present DELTA, a decoupled learning approach designed to enhance learning representations and address the substantial imbalance in LTOCL. We enhance the learning process by adapting supervised contrastive learning to attract similar samples and repel dissimilar (out-of-class) samples. Further, by balancing gradients during training using an equalization loss, DELTA significantly enhances learning outcomes and successfully mitigates catastrophic forgetting. Through extensive evaluation, we demonstrate that DELTA improves the capacity for incremental learning, surpassing existing OCL methods. Our results suggest considerable promise for applying OCL in real-world applications.
Paper Structure (17 sections, 8 equations, 4 figures, 3 tables)

This paper contains 17 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration depicts online and offline setups for continual learning with a long-tailed distribution. In the continual learning process, tasks appear sequentially, one at a time. In the "online" scenario, the model only accesses the current task and its distribution, while the "offline" scenario grants access to the complete task set and their distributions. Additionally, the "online" approach involves training task data with a single pass, while the "offline" approach involves multiple passes across the entire dataset.
  • Figure 2: An overview of the DELTA framework: At task $t$, the current batch of samples($X_t$) and samples retrieved from the memory buffer ($B_t$) undergo augmentation ($\hat{X}_t$, $\hat{B}_t$) and are then combined ($G_t$). This combined data is directed sequentially through a dual-stage training pipeline. In the first stage, the framework utilizes contrastive learning to generate effective data representations involving a contrastive loss ($L_{contrastive}$). During the second stage, the learning approach is decoupled by keeping all layers frozen except for the classification layer ($O^t$). This targeted training employs the weight equalization loss ($L_{EQ}$) to train a balanced classifier and reduce the shift in future data representations.
  • Figure 3: Confusion matrices for DELTA, OnPro onpro, and CBRS CBRS on CIFAR100-LT with a memory buffer of 2,000 show distinct patterns. Single-stage methods(OnPro, CBRS) are prone to a bias towards recent tasks, particularly with long-tailed samples, often misclassifying numerous samples as belonging to the latest task classes. DELTA exhibits a reduced bias thanks to its unique decoupled learning architecture that incorporates a contrastive learner and employs an equalization loss.
  • Figure 4: Performance of DELTA at $\rho = 0.01$ (top), and DELTA at $\rho = 1$ (conventional) with an increasing number of paired exemplars. The graph displays CIFAR100-LT utilizing a 2K buffer across 20 tasks.