Table of Contents
Fetching ...

Mathematics of Continual Learning

Liangzu Peng, René Vidal

TL;DR

This paper establishes a principled bridge between continual learning and adaptive filtering by mapping classic adaptive-filtering algorithms—LMS, APA, RLS, and Kalman Filter—onto continual-learning scenarios. It shows that LMS can be viewed as an online, memoryless learner with exponential convergence on linear tasks, while APA integrates past data via constraint projections; RLS and KF generalize these ideas to weighted past data and state-space task relationships, with RTS smoothing enabling positive backward transfer. The authors further connect these methods to ideal continual learning (ICL) and gradient-projection approaches, and extend the insights to layer-wise and linearized nonlinear models, providing a cohesive mathematical foundation for understanding forgetting, task relationships, and continual adaptation. Overall, the work suggests a rigorous, finitary basis for designing continual-learning algorithms, including memory-based, regularization-based, and expansion-based strategies, and points to rich future directions in nonlinear extensions and kernel- or subspace-tracking analogies.

Abstract

Continual learning is an emerging subject in machine learning that aims to solve multiple tasks presented sequentially to the learner without forgetting previously learned tasks. Recently, many deep learning based approaches have been proposed for continual learning, however the mathematical foundations behind existing continual learning methods remain underdeveloped. On the other hand, adaptive filtering is a classic subject in signal processing with a rich history of mathematically principled methods. However, its role in understanding the foundations of continual learning has been underappreciated. In this tutorial, we review the basic principles behind both continual learning and adaptive filtering, and present a comparative analysis that highlights multiple connections between them. These connections allow us to enhance the mathematical foundations of continual learning based on existing results for adaptive filtering, extend adaptive filtering insights using existing continual learning methods, and discuss a few research directions for continual learning suggested by the historical developments in adaptive filtering.

Mathematics of Continual Learning

TL;DR

This paper establishes a principled bridge between continual learning and adaptive filtering by mapping classic adaptive-filtering algorithms—LMS, APA, RLS, and Kalman Filter—onto continual-learning scenarios. It shows that LMS can be viewed as an online, memoryless learner with exponential convergence on linear tasks, while APA integrates past data via constraint projections; RLS and KF generalize these ideas to weighted past data and state-space task relationships, with RTS smoothing enabling positive backward transfer. The authors further connect these methods to ideal continual learning (ICL) and gradient-projection approaches, and extend the insights to layer-wise and linearized nonlinear models, providing a cohesive mathematical foundation for understanding forgetting, task relationships, and continual adaptation. Overall, the work suggests a rigorous, finitary basis for designing continual-learning algorithms, including memory-based, regularization-based, and expansion-based strategies, and points to rich future directions in nonlinear extensions and kernel- or subspace-tracking analogies.

Abstract

Continual learning is an emerging subject in machine learning that aims to solve multiple tasks presented sequentially to the learner without forgetting previously learned tasks. Recently, many deep learning based approaches have been proposed for continual learning, however the mathematical foundations behind existing continual learning methods remain underdeveloped. On the other hand, adaptive filtering is a classic subject in signal processing with a rich history of mathematically principled methods. However, its role in understanding the foundations of continual learning has been underappreciated. In this tutorial, we review the basic principles behind both continual learning and adaptive filtering, and present a comparative analysis that highlights multiple connections between them. These connections allow us to enhance the mathematical foundations of continual learning based on existing results for adaptive filtering, extend adaptive filtering insights using existing continual learning methods, and discuss a few research directions for continual learning suggested by the historical developments in adaptive filtering.

Paper Structure

This paper contains 9 sections, 13 theorems, 52 equations, 8 figures, 1 table.

Key Result

Theorem 1

Let $\{\theta^t\}_{t=0}^T$ be the iterates of eq:LMS with $\theta^0=0$, $\gamma\in(0,2)$. Assume $x_t$'s are independent and identically distributed, drawn according to some distribution on the sphere $\{x\in \mathbb{R}^d: \| x\|_2 = 1\}$. Write $\Sigma_x:= \mathbb{E}[x_tx_t^\top]$, and assume $\Sig

Figures (8)

  • Figure 1: Visualization of a $4\times 4$ error matrix where $\epsilon_{ij}$ is the error of model $\theta^j$ on task $i$, as defined in \ref{['eq:err-mat']}. Online opitmization considers errors $\epsilon_{ij}$ with $j=i-1$. Finetuning considers errors $\epsilon_{ii}$ on the current task. Continual learning considers errors on both the current and previous tasks.
  • Figure 2: Convergence of $z^t:=\theta^t - \theta^*$ to the intersection $0$ with the constant stepsize $\gamma_t=1$ of \ref{['theorem:ap']} (dotted red arrows) or with the alternating stepsize of \ref{['theorem:opt']} (dashed blue arrows).
  • Figure 3: Pictorial proof of \ref{['theorem:APA-obj-t->0']} in $\mathbb{R}^3$. \ref{['eq:APA2']} projects $z^0$ onto $\textnormal{Span}(x_1)^\perp$ and then $\textnormal{Span}(x_1,x_2)^\perp$ (red arrows). \ref{['eq:min-norm']} projects $z^0$ directly onto $\textnormal{Span}(x_1,x_2)^\perp$ (blue arrow). They reach the same point.
  • Figure 4: \ref{['eq:GP']} ensures that, for any layer $\ell$ and any $t>i$, the output features $f_{\theta^{t}}^{\ell} (x_i)$ is invariant, even though $\theta^t$ changes with $t$. This is indicated by the same color in each row and proved in \ref{['theorem:GP']}.
  • Figure 5: For the two tasks with data $(x_1,y_1)=(2,4)$ and $(x_2,y_2)=(1,1)$, the respective best models are $\theta_1^*=2$ and $\theta_2^*=1$ (gray arrows). \ref{['eq:RLS-AF']} outputs a model that averages $\theta_1^*$ and $\theta_2^*$ (blue arrows).
  • ...and 3 more figures

Theorems & Definitions (32)

  • Example 1: Continual Learning of Large Language Models
  • Theorem 1: Section 5.5.1 Theodoridis-2020
  • proof
  • Definition 1
  • Theorem 2: Theorems 8, 10 & Lemma 9 Evron-COLT2022
  • proof
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • ...and 22 more