Table of Contents
Fetching ...

Retrospective Feature Estimation for Continual Learning

Nghia D. Nguyen, Hieu Trung Nguyen, Ang Li, Hoang Pham, Viet Anh Nguyen, Khoa D. Doan

TL;DR

This paper tackles catastrophic forgetting in continual learning by introducing Retrospective Feature Estimation (RFE), a mechanism that uses a chain of lightweight retrospector modules to map current features $f_t(\boldsymbol{x})$ back toward past representations $f_{t-1}(\boldsymbol{x})$, enabling backward rectification of learned knowledge. The retrospector training relies on a latent-estimation loss $\mathcal{L}_{FE}$ and supports three data strategies (RFE, RFE-P, RFE-B) to balance privacy and performance, while keeping changes to the main task learning minimal. Empirically, RFE and its variants achieve competitive or superior performance compared to strong rehearsal-based baselines on standard CL benchmarks (S-CIFAR10, S-CIFAR100, S-TinyImg), with particular gains on long task sequences and in TIL/CIL scenarios. The approach is data-efficient, potentially data-free, and integrates into existing CL pipelines by adding lightweight retrospector modules that operate post-training, offering a principled alternative to traditional replay or architectural expansion methods.

Abstract

The intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which interferes with remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches often retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored direction for CL called Retrospective Feature Estimation (RFE). RFE learns to reverse feature changes by aligning the features from the current trained DNN backward to the feature space of the old task, where performing predictions is easier. This retrospective process utilizes a chain of small feature mapping networks called retrospector modules. Empirical experiments on several CL benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods, motivating further research into retrospective mechanisms as a principled alternative for mitigating catastrophic forgetting in CL. Code is available at: https://github.com/mail-research/retrospective-feature-estimation.

Retrospective Feature Estimation for Continual Learning

TL;DR

This paper tackles catastrophic forgetting in continual learning by introducing Retrospective Feature Estimation (RFE), a mechanism that uses a chain of lightweight retrospector modules to map current features back toward past representations , enabling backward rectification of learned knowledge. The retrospector training relies on a latent-estimation loss and supports three data strategies (RFE, RFE-P, RFE-B) to balance privacy and performance, while keeping changes to the main task learning minimal. Empirically, RFE and its variants achieve competitive or superior performance compared to strong rehearsal-based baselines on standard CL benchmarks (S-CIFAR10, S-CIFAR100, S-TinyImg), with particular gains on long task sequences and in TIL/CIL scenarios. The approach is data-efficient, potentially data-free, and integrates into existing CL pipelines by adding lightweight retrospector modules that operate post-training, offering a principled alternative to traditional replay or architectural expansion methods.

Abstract

The intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which interferes with remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches often retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored direction for CL called Retrospective Feature Estimation (RFE). RFE learns to reverse feature changes by aligning the features from the current trained DNN backward to the feature space of the old task, where performing predictions is easier. This retrospective process utilizes a chain of small feature mapping networks called retrospector modules. Empirical experiments on several CL benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods, motivating further research into retrospective mechanisms as a principled alternative for mitigating catastrophic forgetting in CL. Code is available at: https://github.com/mail-research/retrospective-feature-estimation.

Paper Structure

This paper contains 32 sections, 12 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: At task $t$, the feature extractor $f_{t}$ and classifier head $w_{t}$ are optimized on the dataset $\mathcal{D}^{\text{train}}_t$. During inference for a test sample from task $t$, we forward the input data $\boldsymbol{x}\in \mathcal{D}_t^{\text{test}}$ through the feature extractor and classifier head to obtain the logits. After learning all $N$ tasks, the DNN loses performance on task $t$ due to catastrophic forgetting. Therefore, the latent representation $f_{N}(\boldsymbol{x})$ is propagated through a series of retrospector module $r_N,\ldots, r_{t+1}$ to perform incremental latent rectification and obtained approximated representations $\hat{f}_{N-1}, \ldots, \hat{f}_{t}$. The logits can be obtained by passing the recovered representation to the respective classifier head.
  • Figure 2: The retrospector module includes a weak auxiliary feature extractor $h_t$, linear mappings $a^f_t, a^h_t, b_t$, and soft gatings $g^f_t, g^h_t$. The joint information from the projected representations from both $f_t$ and $h_t$ is used to compute the gating value for the rectified representation.
  • Figure 3: The TIL accuracy with 1000 exemplars on 10 tasks of S-TinyImg (lighter color is better). The vertical axis represents the task the model has been trained on. The horizontal axis represents the task identity. The value in the cell is the task's accuracy. RFE-P demonstrates a forgetting rate comparable to or better than other methods without revisiting distant task samples. RFE-B performance is more stable for long chaining.
  • Figure 4: The evolving TIL average accuracies of CL methods with 1000 exemplars on 20 tasks of S-TinyImg. RFE-P and RFE-B consistently improve over baseline.
  • Figure 5: We employ PCA to visualize the rectified latent space after training on task $t$ and predicting task $t' (t' < t)$ of S-CIFAR100. By visualizing the original representation ($f_{t'}(\boldsymbol{x})$), the drifted representation ($f_t(\boldsymbol{x})$), the rectified representation ($\hat{f}_{t'}(\boldsymbol{x})$), we demonstrate RFE effectiveness. The closer the rectified representation and the original representation, the better the performance. For \ref{['eq:train_full']}, we set $\alpha=0$ (no regularization) to clearly visualize the catastrophic forgetting and retrospector module performance.
  • ...and 1 more figures