Table of Contents
Fetching ...

Revised Regularization for Efficient Continual Learning through Correlation-Based Parameter Update in Bayesian Neural Networks

Sanchar Palit, Biplab Banerjee, Subhasis Chaudhuri

TL;DR

A Bayesian neural network-based continual learning algorithm using Variational Inference, aiming to overcome several drawbacks of existing methods, and introduces a regularization term that specifically targets the dynamics and population of the mean and variance of the parameters.

Abstract

We propose a Bayesian neural network-based continual learning algorithm using Variational Inference, aiming to overcome several drawbacks of existing methods. Specifically, in continual learning scenarios, storing network parameters at each step to retain knowledge poses challenges. This is compounded by the crucial need to mitigate catastrophic forgetting, particularly given the limited access to past datasets, which complicates maintaining correspondence between network parameters and datasets across all sessions. Current methods using Variational Inference with KL divergence risk catastrophic forgetting during uncertain node updates and coupled disruptions in certain nodes. To address these challenges, we propose the following strategies. To reduce the storage of the dense layer parameters, we propose a parameter distribution learning method that significantly reduces the storage requirements. In the continual learning framework employing variational inference, our study introduces a regularization term that specifically targets the dynamics and population of the mean and variance of the parameters. This term aims to retain the benefits of KL divergence while addressing related challenges. To ensure proper correspondence between network parameters and the data, our method introduces an importance-weighted Evidence Lower Bound term to capture data and parameter correlations. This enables storage of common and distinctive parameter hyperspace bases. The proposed method partitions the parameter space into common and distinctive subspaces, with conditions for effective backward and forward knowledge transfer, elucidating the network-parameter dataset correspondence. The experimental results demonstrate the effectiveness of our method across diverse datasets and various combinations of sequential datasets, yielding superior performance compared to existing approaches.

Revised Regularization for Efficient Continual Learning through Correlation-Based Parameter Update in Bayesian Neural Networks

TL;DR

A Bayesian neural network-based continual learning algorithm using Variational Inference, aiming to overcome several drawbacks of existing methods, and introduces a regularization term that specifically targets the dynamics and population of the mean and variance of the parameters.

Abstract

We propose a Bayesian neural network-based continual learning algorithm using Variational Inference, aiming to overcome several drawbacks of existing methods. Specifically, in continual learning scenarios, storing network parameters at each step to retain knowledge poses challenges. This is compounded by the crucial need to mitigate catastrophic forgetting, particularly given the limited access to past datasets, which complicates maintaining correspondence between network parameters and datasets across all sessions. Current methods using Variational Inference with KL divergence risk catastrophic forgetting during uncertain node updates and coupled disruptions in certain nodes. To address these challenges, we propose the following strategies. To reduce the storage of the dense layer parameters, we propose a parameter distribution learning method that significantly reduces the storage requirements. In the continual learning framework employing variational inference, our study introduces a regularization term that specifically targets the dynamics and population of the mean and variance of the parameters. This term aims to retain the benefits of KL divergence while addressing related challenges. To ensure proper correspondence between network parameters and the data, our method introduces an importance-weighted Evidence Lower Bound term to capture data and parameter correlations. This enables storage of common and distinctive parameter hyperspace bases. The proposed method partitions the parameter space into common and distinctive subspaces, with conditions for effective backward and forward knowledge transfer, elucidating the network-parameter dataset correspondence. The experimental results demonstrate the effectiveness of our method across diverse datasets and various combinations of sequential datasets, yielding superior performance compared to existing approaches.

Paper Structure

This paper contains 11 sections, 13 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) Efficient parameter updating for a convolutional neural network involves updating the model parameters with the revised loss function. (b)Efficient parameter updating for a convolutional neural network involves updating the model parameters with the revised loss function. Following this, the basis for the differentiated subspace is determined using SVD on the representation matrix. Upon establishing correspondence, this basis is then shifted to the common subspace.
  • Figure 2: Illustration of backward knowledge transfer. (a) parameters at the beginning (b) parameters after training on task 1 (c) parameters after training on task 2. The parameters that exhibited uncertainty after task 1 and were subsequently learned during task 2 contribute to the model's ability to approach a closer alignment between the log evidence curve and its surrogate.
  • Figure 3: Sigma at different convolutional layers for CIFAR100. (a) layers 1,2 (b) layers 3,4 (c) layers 5,6. Best viewed when zoomed in.
  • Figure 4: Drift of means of weight parameters while training across 10 sessions on CIFAR 100 for the first bayesian convolutional layer using GTM. We have flattened the vectors into one dimensional victors while visualization. Here, it is crucial to note that in each session, a new category is incorporated, representing the parameters associated with that session. Best viewed when zoomed in.
  • Figure 5: Drift of means of weight parameters while training split MNIST across 5 sessions at dense layer $(\mathbb{R}^{m \times k})$ using GTM. This same pattern is followed by all the 1800 tensors $(\mathbb{R}^{m})$ of size 256 $(\mathbb{R}^{k})$.