Continual learning with the neural tangent ensemble
Ari S. Benjamin, Christian Pehle, Kyle Daruwalla
TL;DR
This work reframes neural networks as Bayesian ensembles of neural tangent experts (NTEs), enabling continual learning without forgetting by posterior weighting over fixed tangent components in the lazy regime. It shows that a first-order Taylor expansion around a seed point makes the network an ensemble of $N$ classifiers, each contributing a probability distribution, with learning corresponding to updating ensemble weights via a posterior that is nearly equivalent to stochastic gradient descent on the initialization. In finite-width (rich-regime) networks, the tangent experts become adaptive, but the authors derive a practical NTE rule using current gradients that remains effective when networks scale, and they demonstrate that momentum harms forgetting while width can improve retention under certain optimizers. The framework provides a principled Bayesian interpretation of forgetting and suggests concrete directions to mitigate it, including using near-initialization dynamics and carefully chosen optimizers, with broad implications for understanding and improving continual learning in deep models.
Abstract
A natural strategy for continual learning is to weigh a Bayesian ensemble of fixed functions. This suggests that if a (single) neural network could be interpreted as an ensemble, one could design effective algorithms that learn without forgetting. To realize this possibility, we observe that a neural network classifier with N parameters can be interpreted as a weighted ensemble of N classifiers, and that in the lazy regime limit these classifiers are fixed throughout learning. We call these classifiers the neural tangent experts and show they output valid probability distributions over the labels. We then derive the likelihood and posterior probability of each expert given past data. Surprisingly, the posterior updates for these experts are equivalent to a scaled and projected form of stochastic gradient descent (SGD) over the network weights. Away from the lazy regime, networks can be seen as ensembles of adaptive experts which improve over time. These results offer a new interpretation of neural networks as Bayesian ensembles of experts, providing a principled framework for understanding and mitigating catastrophic forgetting in continual learning settings.
