Table of Contents
Fetching ...

Low-Rank Filtering and Smoothing for Sequential Deep Learning

Joanna Sliwa, Frank Schneider, Nathanael Bosch, Agustinus Kristiadi, Philipp Hennig

TL;DR

The paper reframes sequential deep learning as Bayesian filtering and smoothing over a weight-state space, enabling principled incorporation of task relationships and backwards knowledge transfer. It introduces LR-LGF, a diagonal plus low-rank precision approach built on the generalized Gauss–Newton to enable efficient filtering and smoothing for deep networks. A smoothing extension allows task-specific models to benefit from data seen later, without accessing it, which is valuable for privacy-focused applications. Empirical results on CAMELYON and MNIST demonstrate improved forgetting behavior and competitive performance, with clear guidance on how to set the low-rank budget and how task relationships influence learning.

Abstract

Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network's parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.

Low-Rank Filtering and Smoothing for Sequential Deep Learning

TL;DR

The paper reframes sequential deep learning as Bayesian filtering and smoothing over a weight-state space, enabling principled incorporation of task relationships and backwards knowledge transfer. It introduces LR-LGF, a diagonal plus low-rank precision approach built on the generalized Gauss–Newton to enable efficient filtering and smoothing for deep networks. A smoothing extension allows task-specific models to benefit from data seen later, without accessing it, which is valuable for privacy-focused applications. Empirical results on CAMELYON and MNIST demonstrate improved forgetting behavior and competitive performance, with clear guidance on how to set the low-rank budget and how task relationships influence learning.

Abstract

Learning multiple tasks sequentially requires neural networks to balance retaining knowledge, yet being flexible enough to adapt to new tasks. Regularizing network parameters is a common approach, but it rarely incorporates prior knowledge about task relationships, and limits information flow to future tasks only. We propose a Bayesian framework that treats the network's parameters as the state space of a nonlinear Gaussian model, unlocking two key capabilities: (1) A principled way to encode domain knowledge about task relationships, allowing, e.g., control over which layers should adapt between tasks. (2) A novel application of Bayesian smoothing, allowing task-specific models to also incorporate knowledge from models learned later. This does not require direct access to their data, which is crucial, e.g., for privacy-critical applications. These capabilities rely on efficient filtering and smoothing operations, for which we propose diagonal plus low-rank approximations of the precision matrix in the Laplace approximation (LR-LGF). Empirical results demonstrate the efficiency of LR-LGF and the benefits of the unlocked capabilities.

Paper Structure

This paper contains 23 sections, 29 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: An efficient weight-space Laplace--Gaussian filter ( , ) and smoother ( ) for sequential deep learning. We treat the neural network's parameters as a nonlinear Gaussian state-space model and perform efficient inference using diagonal plus low-rank Laplace--Gaussian filtering and smoothing. During the update step ( ) we train the neural network on the current task using the parameter covariance as a regularizer, then approximate the posterior distribution with a diagonal plus low-rank Laplace approximation. The predict step ( ) adds noise to the model parameters, where the noise covariance ${\bm{Q}}$ can be used to model the type of shift between tasks. Smoothing ( ) allows for training task-specific model parameters ${\bm{\theta}}_t$ that are informed by all tasks, without additional training.
  • Figure 2: The effect of ${\bm{Q}}$ on the average and current task's performance.(Left) Without regularization, we see a significant drop in the average performance across all seen tasks ( ), while the performance on the current task (✖) is strong. (Center) Adding regularization, helps boost the average performance across tasks, but to the detriment of the current task (most notably task $t\!=\!5$). However, older tasks ( ) suffer much less from catastrophic forgetting. (Right) Additionally using a structured ${\bm{Q}}$ can boost the current task performance, while keeping the same average performance across tasks. See \ref{['fig:Q_seeds']} in \ref{['app:exp_details_results']} for a summary of the same experimental results across $8$ random seeds.
  • Figure 3: Smoothing improves the performance on earlier tasks.(Left) Task-wise performance after filtering ( ) or smoothing ( ) up to task $t$, with shaded regions representing one standard deviation across $8$ seeds. (Right) The smoother consistently improves performance by incorporating information from all tasks without accessing their data. The thick line is the mean improvement of smoothed vs. the filtered model with the thin lines showing the improvements for each seed.
  • Figure 4: The low-rank approximation across training.(Top) Our rank $k\!=\!10$ approximation to the precision matrix across tasks on PermutedMNIST (see \ref{['fig:eigs_20']} for $k\!=\!20$). (Bottom) Histograms of the eigenvalues of the approximation's low-rank part. With growing $t$, the eigenvalues increase in magnitude, indicating larger certainty and less flexibility.
  • Figure 5: Examples of the input data fromGradual CAMELYON. (Top) To adjust the brightness, we apply a shift $x_t = x + \Delta_t$ (bottom) next, we normalize $(x_t - \mu_X)/\sigma_X$.
  • ...and 5 more figures