Table of Contents
Fetching ...

Continual learning via probabilistic exchangeable sequence modelling

Hanwen Xing, Christopher Yau

TL;DR

CL-BRUNO tackles continual learning by casting streams of task data as exchangeable sequences and modelling them with a probabilistic Neural Process built on a conditional Real-NVP bijection $f_ heta$. It combines a C-BRUNO module for modelling $p(oldsymbol{X}_t|t,oldsymbol{Y}_t)$ with a likelihood-based update regime and generative replay that yields both distributional and functional regularisation, removing the need to store past samples. The framework supports both task- and class-incremental learning, offering tractable uncertainty quantification for label and task identity and enabling outlier detection. Empirical results on CIFAR-100 and MNIST (plus biomedical datasets in supplement) show competitive or superior performance to state-of-the-art exemplar-free and generative CL methods, with favorable memory and computation characteristics, making it practical for privacy-conscious, real-world deployment.

Abstract

Continual learning (CL) refers to the ability to continuously learn and accumulate new knowledge while retaining useful information from past experiences. Although numerous CL methods have been proposed in recent years, it is not straightforward to deploy them directly to real-world decision-making problems due to their computational cost and lack of uncertainty quantification. To address these issues, we propose CL-BRUNO, a probabilistic, Neural Process-based CL model that performs scalable and tractable Bayesian update and prediction. Our proposed approach uses deep-generative models to create a unified probabilistic framework capable of handling different types of CL problems such as task- and class-incremental learning, allowing users to integrate information across different CL scenarios using a single model. Our approach is able to prevent catastrophic forgetting through distributional and functional regularisation without the need of retaining any previously seen samples, making it appealing to applications where data privacy or storage capacity is of concern. Experiments show that CL-BRUNO outperforms existing methods on both natural image and biomedical data sets, confirming its effectiveness in real-world applications.

Continual learning via probabilistic exchangeable sequence modelling

TL;DR

CL-BRUNO tackles continual learning by casting streams of task data as exchangeable sequences and modelling them with a probabilistic Neural Process built on a conditional Real-NVP bijection . It combines a C-BRUNO module for modelling with a likelihood-based update regime and generative replay that yields both distributional and functional regularisation, removing the need to store past samples. The framework supports both task- and class-incremental learning, offering tractable uncertainty quantification for label and task identity and enabling outlier detection. Empirical results on CIFAR-100 and MNIST (plus biomedical datasets in supplement) show competitive or superior performance to state-of-the-art exemplar-free and generative CL methods, with favorable memory and computation characteristics, making it practical for privacy-conscious, real-world deployment.

Abstract

Continual learning (CL) refers to the ability to continuously learn and accumulate new knowledge while retaining useful information from past experiences. Although numerous CL methods have been proposed in recent years, it is not straightforward to deploy them directly to real-world decision-making problems due to their computational cost and lack of uncertainty quantification. To address these issues, we propose CL-BRUNO, a probabilistic, Neural Process-based CL model that performs scalable and tractable Bayesian update and prediction. Our proposed approach uses deep-generative models to create a unified probabilistic framework capable of handling different types of CL problems such as task- and class-incremental learning, allowing users to integrate information across different CL scenarios using a single model. Our approach is able to prevent catastrophic forgetting through distributional and functional regularisation without the need of retaining any previously seen samples, making it appealing to applications where data privacy or storage capacity is of concern. Experiments show that CL-BRUNO outperforms existing methods on both natural image and biomedical data sets, confirming its effectiveness in real-world applications.

Paper Structure

This paper contains 24 sections, 16 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Schematic illustration of how C-BRUNO learns the sequence distribution $p(\mathcal{X}_t| \mathcal{Y}_t) = \prod_{i=1}^N(X_{i,t}|X_{1:i-1,t}, Y_{1:i, t})$. For each feature vector $X_{i,t}$ in the sequence, C-BRUNO first transform it to the corresponding latent variable using the label-dependent mapping $\hat{\mathbf{z}}_{i,t} = f(X_{i,t}|t, Y_{i,t})$, then approximates the one-step-ahead conditional $p(X_{i,t}|X_{1:i-1,t}, Y_{1:i,t})$ by $p_t(\hat{\mathbf{z}}_{i,t}|\hat{\mathbf{z}}_{1:i-1})\left|\det \frac{\partial \hat{\mathbf{z}}_{i,t}}{\partial X_{i,t}}\right|$. Exchangeability is guaranteed by the specific covariance function in the latent distribution $p(\hat{\mathbf{z}}_{1:N_t})$. In the generation/inference phase, given a label $Y^*$, a new latent variable $\mathbf{z}^*$ is first generated from $p(\cdot|\hat{\mathbf{z}}_{1:N, t})$, a multivariate Gaussian whose mean and covariance depend on the observed sequence, and then transformed to the generated feature vector $X^* = f^{-1}(\mathbf{z}^*|t, Y^*)$ under label $Y^*$.
  • Figure 2: Schematic illustration of TIL in CL-BRUNO. Pseudo datasets are generated from the previous latent predictive distributions $p(\cdot|\hat{\mathbf{z}}_{1:N_t,t})$ and the bijective mapping $f_{old}$. Note that in the TIL phase, the new bijective mapping $f_{new}$ learns to 1) map the new dataset $\mathcal{D}_{T+1}$ to a series of latent variables and compute the corresponding latent predictive $p(\cdot|\hat{\mathbf{z}}_{1:N_{T+1},T+1})$ (i.e. learning from new data) and 2) map the pseudo-datasets $\hat{\mathcal{D}}_t$ back to latent variables that resemble samples drawn from the previous latent predictive distributions $p(\cdot|\hat{\mathbf{z}}_{1:N_t,t})$ (i.e. retaining learnt knowledge).
  • Figure 3: Scatter plots of the first two dimensions of samples in the test set (cross) and samples generated from the trained CL-BRUNO (circle) for each of the 4 tasks.
  • Figure 4: Evolution of misclassification rate specific to each incremental datasets. Each point represents the misclassification rate specific to an incremental dataset evaluated at a specific training step using the incrementally trained CL-BRUNO. Each triangle represents the same quantity given by a CL-BRUNO who has access to all historical datasets (oracle). (a) PANCAN dataset under a CIL scenario, (b) ICI dataset under a TIL scenario. Note that ICI dataset consists of tasks with only one class, which leads to zero test error.
  • Figure 5: ICI dataset. (a): Heat map of predicted task identity. Each row corresponds to the categorical task identity distribution \ref{['eq:taskid']} averaged over test samples from each task. (b): t-SNE van2008visualizing projection of the pre-processed RNAseq measurement associated with different therapy types in ICI dataset. (c): Averaged predicted probabilities for different groups of patients under treatment $\texttt{Atezo}$. Patients are split into four groups based on cancer type (Kidney cancer vs Non-kidney cancer) and responsiveness to treatment (Responder vs Non-responder). (d): Averaged predicted probabilities for groups of patients under treatment $\texttt{Nivo}$. Patients are split into four groups in a similar fashion to (c).
  • ...and 1 more figures