On Sequential Maximum a Posteriori Inference for Continual Learning
Menghao Waiyan William Zhu, Ercan Engin Kuruoğlu
TL;DR
This work reframes continual learning as sequential maximum a posteriori (MAP) inference, deriving a loss recursion $\mathfrak L_t(\theta)=\mathfrak L_{t-1}(\theta)+\mathfrak l_t(\theta)$ and addressing the challenge of unavailable past data by proposing two coreset-free approximations. Autodiff Quadratic Consolidation (AQC) uses a full quadratic (Laplace) approximation via Hessians, yielding a Hessian-augmented prior term, while Neural Consolidation (NC) trains a consolidator network to approximate the previous loss with $\hat{\mathfrak L}_t(\theta)=\lambda\kappa(\theta;\phi)+\mathfrak l_t(\theta)$. Experiments on classical (Iris, Wine) and image (MNIST, CIFAR-10, HAM-8, BCN-12) task sequences show that AQC is robust for high-dimensional visual features and can approach joint MAP performance when a pre-trained feature extractor is used, whereas NC tends to excel on low-dimensional classical tasks. The results underscore the value of pre-training for continual learning and demonstrate practical, data-efficient, coreset-free strategies for mitigating forgetting in sequential tasks.
Abstract
We formulate sequential maximum a posteriori inference as a recursion of loss functions and reduce the problem of continual learning to approximating the previous loss function. We then propose two coreset-free methods: autodiff quadratic consolidation, which uses an accurate and full quadratic approximation, and neural consolidation, which uses a neural network approximation. These methods are not scalable with respect to the neural network size, and we study them for classification tasks in combination with a fixed pre-trained feature extractor. We also introduce simple but challenging classical task sequences based on Iris and Wine datasets. We find that neural consolidation performs well in the classical task sequences, where the input dimension is small, while autodiff quadratic consolidation performs consistently well in image task sequences with a fixed pre-trained feature extractor, achieving comparable performance to joint maximum a posteriori training in many cases.
