Table of Contents
Fetching ...

Knowledge Adaptation as Posterior Correction

Mohammad Emtiyaz Khan

TL;DR

The paper introduces posterior correction as a unifying framework for rapid knowledge adaptation across continual learning, unlearning, model merging, and federated learning. By recasting adaptation as updating old posterior approximations via a correction term derived from the Bayesian Learning Rule, it shows that many existing methods are special cases of this principle and that richer posteriors reduce the required corrections. The work provides a spectrum of concrete instantiations (isotropic, diagonal, and full Gaussian posteriors) and connects regularization, prediction matching, and influence estimation to posterior-correction terms, including Memory Replay and K-priors. Together, these insights offer a principled path to design faster, more reliable, and scalable adaptive algorithms for sequential and distributed learning tasks.

Abstract

Adaptation is the holy grail of intelligence, but even the best AI models lack the adaptability of toddlers. In spite of great progress, little is known about the mechanisms by which machines can learn to adapt as fast as humans and animals. Here, we cast adaptation as `correction' of old posteriors and show that a wide-variety of existing adaptation methods follow this very principle, including those used for continual learning, federated learning, unlearning, and model merging. In all these settings, more accurate posteriors often lead to smaller corrections and can enable faster adaptation. Posterior correction is derived by using the dual representation of the Bayesian Learning Rule of Khan and Rue (2023), where the interference between the old representation and new information is quantified by using the natural-gradient mismatch. We present many examples demonstrating how machines can learn to adapt quickly by using posterior correction.

Knowledge Adaptation as Posterior Correction

TL;DR

The paper introduces posterior correction as a unifying framework for rapid knowledge adaptation across continual learning, unlearning, model merging, and federated learning. By recasting adaptation as updating old posterior approximations via a correction term derived from the Bayesian Learning Rule, it shows that many existing methods are special cases of this principle and that richer posteriors reduce the required corrections. The work provides a spectrum of concrete instantiations (isotropic, diagonal, and full Gaussian posteriors) and connects regularization, prediction matching, and influence estimation to posterior-correction terms, including Memory Replay and K-priors. Together, these insights offer a principled path to design faster, more reliable, and scalable adaptive algorithms for sequential and distributed learning tasks.

Abstract

Adaptation is the holy grail of intelligence, but even the best AI models lack the adaptability of toddlers. In spite of great progress, little is known about the mechanisms by which machines can learn to adapt as fast as humans and animals. Here, we cast adaptation as `correction' of old posteriors and show that a wide-variety of existing adaptation methods follow this very principle, including those used for continual learning, federated learning, unlearning, and model merging. In all these settings, more accurate posteriors often lead to smaller corrections and can enable faster adaptation. Posterior correction is derived by using the dual representation of the Bayesian Learning Rule of Khan and Rue (2023), where the interference between the old representation and new information is quantified by using the natural-gradient mismatch. We present many examples demonstrating how machines can learn to adapt quickly by using posterior correction.

Paper Structure

This paper contains 42 sections, 1 theorem, 114 equations, 9 figures, 1 table.

Key Result

Theorem 1

For $q^{\text{iso}}$ family, eq:2stagePoCoNatparamForm reduces to eq:ama_update if we approximate $\mathbb{E}_{q_i}[\ell_i] \approx \ell_i(\mathbf{m}_i)$.

Figures (9)

  • Figure 1: Four popular scenarios for adaptation of model parameters $\boldsymbol{\theta}$. Black arrows indicate the flow of adapted knowledge, while the gray arrows indicate pre-trained knowledge. Continual learning adapts $\boldsymbol{\theta}_t$ to $\boldsymbol{\theta}_{t+1}$ to include new data $\hbox{${\cal D}$}_{t+1}$. Unlearning aims to estimate the model after removing a specific data set $\hbox{${\cal D}$}_i$. Model merging attempts to improve a pre-trained base model $\boldsymbol{\theta}_0$ by merging back the fine-tuned models $\boldsymbol{\theta}_1$ and $\boldsymbol{\theta}_2$. Finally, federated learning aims to obtain a joint model $\boldsymbol{\theta}_\text{jnt}$ by using locally trained models.
  • Figure 2: We illustrate the dual form given in \ref{['eq:qdualform']} on a 1-D binary classification example (circle vs square in the leftmost panel). The classifier is simply a threshold $\theta$, fit via a logistic likelihood (pink sigmoid). We use a Gaussian $q_t(\theta)$ with mean $m_t$ shown with vertical dashed line. The logistic losses for the three examples are shown in the top row of middle panel (transparent lines) along with their quadratic sites (solid lines). The bottom row displays likelihoods. The rightmost panel applies \ref{['eq:qdualform']} to form $q_t$. The example farthest from the classifier has the smallest curvature and least contribution to $q_t$. The example is discussed in detail in \ref{['app:oneDex']} along with a general case of logistic regression in \ref{['app:logreg']}.
  • Figure 3: The left panel compares the site $\hat{\ell}^{\text{iso}}_{i|t}$ to the ${\text{first}}$-order Taylor expansion of $\ell_i$ at $\mathbf{m}_t$. The right panel compares $\hat{\ell}^{\text{full}}_{i|t}$ to the ${\text{second}}$-order Taylor. The sites use expectations of the gradients and Hessians over $q_t$ and capture more global information around $\mathbf{m}_t$.
  • Figure 4: Visualization of corrections as prediction mismatches for linear regression with $q^{\text{iso}}$ family. The old model is trained on the gray 'o' and the new model additionally includes the black '$\times$' too. Corrections are mismatches over old examples (dashed, vertical red lines). The right figure shows an additional slightly-worse model (in blue) with larger mismatches. The red model has smaller correction because it is closer to the black model which also supports the intuition that smaller corrections imply faster adaptation.
  • Figure 5: One dimensional binary classification, and its likelihood and loss function.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1