Table of Contents
Fetching ...

On Sequential Maximum a Posteriori Inference for Continual Learning

Menghao Waiyan William Zhu, Ercan Engin Kuruoğlu

TL;DR

This work reframes continual learning as sequential maximum a posteriori (MAP) inference, deriving a loss recursion $\mathfrak L_t(\theta)=\mathfrak L_{t-1}(\theta)+\mathfrak l_t(\theta)$ and addressing the challenge of unavailable past data by proposing two coreset-free approximations. Autodiff Quadratic Consolidation (AQC) uses a full quadratic (Laplace) approximation via Hessians, yielding a Hessian-augmented prior term, while Neural Consolidation (NC) trains a consolidator network to approximate the previous loss with $\hat{\mathfrak L}_t(\theta)=\lambda\kappa(\theta;\phi)+\mathfrak l_t(\theta)$. Experiments on classical (Iris, Wine) and image (MNIST, CIFAR-10, HAM-8, BCN-12) task sequences show that AQC is robust for high-dimensional visual features and can approach joint MAP performance when a pre-trained feature extractor is used, whereas NC tends to excel on low-dimensional classical tasks. The results underscore the value of pre-training for continual learning and demonstrate practical, data-efficient, coreset-free strategies for mitigating forgetting in sequential tasks.

Abstract

We formulate sequential maximum a posteriori inference as a recursion of loss functions and reduce the problem of continual learning to approximating the previous loss function. We then propose two coreset-free methods: autodiff quadratic consolidation, which uses an accurate and full quadratic approximation, and neural consolidation, which uses a neural network approximation. These methods are not scalable with respect to the neural network size, and we study them for classification tasks in combination with a fixed pre-trained feature extractor. We also introduce simple but challenging classical task sequences based on Iris and Wine datasets. We find that neural consolidation performs well in the classical task sequences, where the input dimension is small, while autodiff quadratic consolidation performs consistently well in image task sequences with a fixed pre-trained feature extractor, achieving comparable performance to joint maximum a posteriori training in many cases.

On Sequential Maximum a Posteriori Inference for Continual Learning

TL;DR

This work reframes continual learning as sequential maximum a posteriori (MAP) inference, deriving a loss recursion and addressing the challenge of unavailable past data by proposing two coreset-free approximations. Autodiff Quadratic Consolidation (AQC) uses a full quadratic (Laplace) approximation via Hessians, yielding a Hessian-augmented prior term, while Neural Consolidation (NC) trains a consolidator network to approximate the previous loss with . Experiments on classical (Iris, Wine) and image (MNIST, CIFAR-10, HAM-8, BCN-12) task sequences show that AQC is robust for high-dimensional visual features and can approach joint MAP performance when a pre-trained feature extractor is used, whereas NC tends to excel on low-dimensional classical tasks. The results underscore the value of pre-training for continual learning and demonstrate practical, data-efficient, coreset-free strategies for mitigating forgetting in sequential tasks.

Abstract

We formulate sequential maximum a posteriori inference as a recursion of loss functions and reduce the problem of continual learning to approximating the previous loss function. We then propose two coreset-free methods: autodiff quadratic consolidation, which uses an accurate and full quadratic approximation, and neural consolidation, which uses a neural network approximation. These methods are not scalable with respect to the neural network size, and we study them for classification tasks in combination with a fixed pre-trained feature extractor. We also introduce simple but challenging classical task sequences based on Iris and Wine datasets. We find that neural consolidation performs well in the classical task sequences, where the input dimension is small, while autodiff quadratic consolidation performs consistently well in image task sequences with a fixed pre-trained feature extractor, achieving comparable performance to joint maximum a posteriori training in many cases.
Paper Structure (19 sections, 6 equations, 2 figures, 1 table)

This paper contains 19 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Bayesian network for continual learning. $\bm\theta$ is the collection of parameters of the neural network, $\bm x_{1:t}$ are the inputs and $\bm y_{1:t}$ are the outputs.
  • Figure 2: Visualizations of prediction probabilities for the methods on CI Split 2D Iris. The x-axis is the petal length (cm) and the y-axis is the petal width (cm). The pseudocolor plot shows the prediction probabilities, where the 3 class probabilities are mapped to the red, green and blue values, respectively, and the dots show the observed data. NC performs the best and is better with softmax regression.