Function-space Parameterization of Neural Networks for Sequential Learning

Aidan Scannell; Riccardo Mereu; Paul Chang; Ella Tamir; Joni Pajarinen; Arno Solin

Function-space Parameterization of Neural Networks for Sequential Learning

Aidan Scannell, Riccardo Mereu, Paul Chang, Ella Tamir, Joni Pajarinen, Arno Solin

TL;DR

This work introduces Sparse Function-space Representation (sfr), a dual-parameterized GP derived from a trained neural network to operate in function space, addressing sequential learning challenges such as forgetting and data integration without retraining. By linearizing the NN around MAP weights and using a dual GP parameterization with inducing points, sfr achieves scalable uncertainty quantification while preserving information from all data. It enables continual learning through a function-space regularizer and supports fast incorporation of new data via dual updates, with demonstrated benefits in supervised learning, out-of-distribution detection, and model-based RL. Overall, sfr blends the scalability of neural nets with GP-style uncertainty, offering a practical approach for large-scale sequential learning and decision-making tasks. The approach is applicable to image-rich data and millions of points, where traditional GPs struggle, and it provides a principled, flexible mechanism for uncertainty-guided exploration and robust continual learning.

Abstract

Sequential learning paradigms pose challenges for gradient-based deep learning due to difficulties incorporating new data and retaining prior knowledge. While Gaussian processes elegantly tackle these problems, they struggle with scalability and handling rich inputs, such as images. To address these issues, we introduce a technique that converts neural networks from weight space to function space, through a dual parameterization. Our parameterization offers: (i) a way to scale function-space methods to large data sets via sparsification, (ii) retention of prior knowledge when access to past data is limited, and (iii) a mechanism to incorporate new data without retraining. Our experiments demonstrate that we can retain knowledge in continual learning and incorporate new data efficiently. We further show its strengths in uncertainty quantification and guiding exploration in model-based RL. Further information and code is available on the project website.

Function-space Parameterization of Neural Networks for Sequential Learning

TL;DR

Abstract

Paper Structure (67 sections, 49 equations, 8 figures, 6 tables, 2 algorithms)

This paper contains 67 sections, 49 equations, 8 figures, 6 tables, 2 algorithms.

Introduction
Related work
Background
BNNs
MAP
Laplace approximation
GPs
Sparse GPs
sfr: Sparse function-space representation of NNs
Linear model for dual updates
Dual parameters from NN
Sparsification via dual parameters
sfr for sequential learning
Continual learning
Incorporating new data without retraining
...and 52 more sections

Figures (8)

Figure 1: Regression example with an MLP: Left: Predictions from the trained neural network. Middle: Our approach summarizes all the training data at the inducing points. The model captures the predictive mean and uncertainty, and (right) incorporates new data without retraining the model.
Figure 2: sfr overview: We linearize the trained NN around the MAP weights ${\bm{w}}^{*}$ and interpret in function space, via a kernel formulation $\kappa(\cdot,\cdot)$ (\ref{['eq:weight_func']}). In contrast to previous approaches, we perform a Laplace approximation on the function-space objective \ref{['eq:laplace']}. This leads to sfr's dual parameterization, scaling to large data sets \ref{['eq:dual_sparse_post']} and incorporating new data efficiently \ref{['eq:fast-updates']}.
Figure 3: Effective sparsification: Comparison of convergence in number of inducing points $M$ in NLPD (mean$\pm$std over 5 seeds) on classification tasks: sfr () vs. GP subset (). Our sfr converges fast for all cases showing clear benefits of its ability to summarize all the data in a sparse model.
Figure 4: OOD detection with CNN: Histograms showing each method's predictive entropy at ID data (FMNIST, blue) where lower is better and at OOD data (MNIST, red) where higher is better.
Figure A5: Uncertainty quantification for binary classification (vs. ). We convert the trained neural network (left) to a sparse GP model that summarizes all data onto a sparse set of inducing points (middle). This gives similar behaviour as running full Hamiltonian Monte Carlo (HMC) on the original neural network model weights (right). Marginal uncertainty depicted by colour intensity.
...and 3 more figures

Function-space Parameterization of Neural Networks for Sequential Learning

TL;DR

Abstract

Function-space Parameterization of Neural Networks for Sequential Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)