Table of Contents
Fetching ...

Dual-CBA: Improving Online Continual Learning via Dual Continual Bias Adaptors from a Bi-level Optimization Perspective

Quanziang Wang, Renzhen Wang, Yichen Wu, Xixi Jia, Minghao Zhou, Deyu Meng

TL;DR

This work tackles online continual learning (CL) where non-stationary data shifts cause catastrophic forgetting and task-recency bias. It introduces a bi-level optimization framework with Dual-CBA, combining a class-specific CBA and a class-agnostic CBA to adapt the posterior $P(Y|X)$ online, paired with Incremental Batch Normalization to stabilize feature statistics. Theoretical results show gradient alignment between training and memory data, and a closed-form solution in the linear case provides intuition for why the method mitigates forgetting. Empirically, Dual-CBA consistently improves performance across four rehearsal-based baselines on three benchmarks, including semi-supervised and offline settings, and demonstrates strong transferability of the class-agnostic CBA. The approach yields real-time evaluation capability without test-time overhead, offering a practical boost for online CL in non-stationary environments.

Abstract

In online continual learning (CL), models trained on changing distributions easily forget previously learned knowledge and bias toward newly received tasks. To address this issue, we present Continual Bias Adaptor (CBA), a bi-level framework that augments the classification network to adapt to catastrophic distribution shifts during training, enabling the network to achieve a stable consolidation of all seen tasks. However, the CBA module adjusts distribution shifts in a class-specific manner, exacerbating the stability gap issue and, to some extent, fails to meet the need for continual testing in online CL. To mitigate this challenge, we further propose a novel class-agnostic CBA module that separately aggregates the posterior probabilities of classes from new and old tasks, and applies a stable adjustment to the resulting posterior probabilities. We combine the two kinds of CBA modules into a unified Dual-CBA module, which thus is capable of adapting to catastrophic distribution shifts and simultaneously meets the real-time testing requirements of online CL. Besides, we propose Incremental Batch Normalization (IBN), a tailored BN module to re-estimate its population statistics for alleviating the feature bias arising from the inner loop optimization problem of our bi-level framework. To validate the effectiveness of the proposed method, we theoretically provide some insights into how it mitigates catastrophic distribution shifts, and empirically demonstrate its superiority through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.

Dual-CBA: Improving Online Continual Learning via Dual Continual Bias Adaptors from a Bi-level Optimization Perspective

TL;DR

This work tackles online continual learning (CL) where non-stationary data shifts cause catastrophic forgetting and task-recency bias. It introduces a bi-level optimization framework with Dual-CBA, combining a class-specific CBA and a class-agnostic CBA to adapt the posterior online, paired with Incremental Batch Normalization to stabilize feature statistics. Theoretical results show gradient alignment between training and memory data, and a closed-form solution in the linear case provides intuition for why the method mitigates forgetting. Empirically, Dual-CBA consistently improves performance across four rehearsal-based baselines on three benchmarks, including semi-supervised and offline settings, and demonstrates strong transferability of the class-agnostic CBA. The approach yields real-time evaluation capability without test-time overhead, offering a practical boost for online CL in non-stationary environments.

Abstract

In online continual learning (CL), models trained on changing distributions easily forget previously learned knowledge and bias toward newly received tasks. To address this issue, we present Continual Bias Adaptor (CBA), a bi-level framework that augments the classification network to adapt to catastrophic distribution shifts during training, enabling the network to achieve a stable consolidation of all seen tasks. However, the CBA module adjusts distribution shifts in a class-specific manner, exacerbating the stability gap issue and, to some extent, fails to meet the need for continual testing in online CL. To mitigate this challenge, we further propose a novel class-agnostic CBA module that separately aggregates the posterior probabilities of classes from new and old tasks, and applies a stable adjustment to the resulting posterior probabilities. We combine the two kinds of CBA modules into a unified Dual-CBA module, which thus is capable of adapting to catastrophic distribution shifts and simultaneously meets the real-time testing requirements of online CL. Besides, we propose Incremental Batch Normalization (IBN), a tailored BN module to re-estimate its population statistics for alleviating the feature bias arising from the inner loop optimization problem of our bi-level framework. To validate the effectiveness of the proposed method, we theoretically provide some insights into how it mitigates catastrophic distribution shifts, and empirically demonstrate its superiority through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks.
Paper Structure (30 sections, 2 theorems, 31 equations, 9 figures, 12 tables, 2 algorithms)

This paper contains 30 sections, 2 theorems, 31 equations, 9 figures, 12 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathcal{G}^{buf} \triangleq \frac{\partial \mathcal{L}^{buf}\left(\mathcal{B}^{buf}; f_{\theta^k(\phi)}\right)}{\partial \theta^k}$ and $\mathcal{G}^{trn} \triangleq \frac{\partial \mathcal{L}^{trn}\left(\mathcal{B}^{trn}; \mathcal{F}_{\theta^k, \phi}\right)}{\partial \theta^k}$ denote the gra where $\alpha > 0$ is the inner-loop learning rate and $\eta > 0$ is the Lipschitz constant.

Figures (9)

  • Figure 1: Tracking the accuracy of the 1st task with different incoming classes of the 4th and 5th tasks as plotted by red and blue lines, respectively. The label distribution $\mathbb P(Y)$ remains unchanged between the two lines, while the final accuracy of the 1st task varies dramatically. This indicates the effectiveness of the feature distribution shifts (i.e., changes of $\mathbb P(X|Y)$) in CL.
  • Figure 2: The average accuracy during the whole continual training process on Split CIFAR-100 with memory buffer size M=2k. The stability gap problem means the performance of old tasks drops upon starting to learn a new task and then recovers quickly. Comparing the baseline ER, the method ER-CBA presented in our conference version CBA aggravates the stability gap while our ER-Dual-CBA alleviates this problem effectively.
  • Figure 3: Method overview. At each iteration step $k$, the classification model parameter $\theta$ and the Dual-CBA parameter $\phi = \{ \omega, \nu \}$ are jointly updated through the bi-level optimization framework, where $\omega$ and $\nu$ represent parameters of the class-specific CBA and class-agnostic CBA, respectively. For the inner loop, the forward process computes the rehearsal training loss and the backward process updates the classification model parameter $\theta(\phi)$ by Eq. (\ref{['eq:inner-opt']}). For the outer loop, the forward process computes the outer objective loss function and the backward process updates the Dual-CBA parameter $\phi$ by Eq. (\ref{['eq:outer-opt']}).
  • Figure 4: Illustration of the posterior distribution $\hat{y}$ predicted by the classification network before and after the task transition timestamp. We train the classification network with class-specific CBA on Split CIFAR-100 and take test samples of the 85th class (which belongs to the 9th task) as an example. The posterior distribution $\hat{y}$ contains the probabilities for all 100 classes and we only show that of one class from each task for clarity. Before the task transition (end of the 9th task), the classification network assigns a high posterior probability to the 85th class. However, after the task transition (start of the 10th task), the posterior probability of the 85th class drops dramatically, seriously biased toward new tasks.
  • Figure 5: Illustration of the old-task posterior probability $\hat{y}_{old}$ and the new-task posterior probability $\hat{y}_{new}$ predicted by the classification network at each task transition timestamp during the whole continual learning process: (a) test samples from all old tasks; (b) test samples from the new task. The classification network is trained by ER on Split CIFAR-100, where the relationship between the old- and new-task probabilities shows a stable trend as training progresses.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2