Table of Contents
Fetching ...

Why the Brain Consolidates: Predictive Forgetting for Optimal Generalisation

Zafeirios Fountas, Adnan Oomerjee, Haitham Bou-Ammar, Jun Wang, Neil Burgess

Abstract

Standard accounts of memory consolidation emphasise the stabilisation of stored representations, but struggle to explain representational drift, semanticisation, or the necessity of offline replay. Here we propose that high-capacity neocortical networks optimise stored representations for generalisation by reducing complexity via predictive forgetting, i.e. the selective retention of experienced information that predicts future outcomes or experience. We show that predictive forgetting formally improves information-theoretic generalisation bounds on stored representations. Under high-fidelity encoding constraints, such compression is generally unattainable in a single pass; high-capacity networks therefore benefit from temporally separated, iterative refinement of stored traces without re-accessing sensory input. We demonstrate this capacity dependence with simulations in autoencoder-based neocortical models, biologically plausible predictive coding circuits, and Transformer-based language models, and derive quantitative predictions for consolidation-dependent changes in neural representational geometry. These results identify a computational role for off-line consolidation beyond stabilisation, showing that outcome-conditioned compression optimises the retention-generalisation trade-off.

Why the Brain Consolidates: Predictive Forgetting for Optimal Generalisation

Abstract

Standard accounts of memory consolidation emphasise the stabilisation of stored representations, but struggle to explain representational drift, semanticisation, or the necessity of offline replay. Here we propose that high-capacity neocortical networks optimise stored representations for generalisation by reducing complexity via predictive forgetting, i.e. the selective retention of experienced information that predicts future outcomes or experience. We show that predictive forgetting formally improves information-theoretic generalisation bounds on stored representations. Under high-fidelity encoding constraints, such compression is generally unattainable in a single pass; high-capacity networks therefore benefit from temporally separated, iterative refinement of stored traces without re-accessing sensory input. We demonstrate this capacity dependence with simulations in autoencoder-based neocortical models, biologically plausible predictive coding circuits, and Transformer-based language models, and derive quantitative predictions for consolidation-dependent changes in neural representational geometry. These results identify a computational role for off-line consolidation beyond stabilisation, showing that outcome-conditioned compression optimises the retention-generalisation trade-off.
Paper Structure (45 sections, 16 equations, 6 figures, 2 tables)

This paper contains 45 sections, 16 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Consolidation as predictive forgetting.a, Conceptual illustration using visual classification. Initial encoding (distributions $Z_0$; lowercase variables denote specific samples) preserves both diagnostic features (cat-specific: red; dog-specific: blue; shared: yellow) and task-irrelevant details (background, lighting: green) to minimise sensory prediction error. Iterative consolidation ($Z_1 \to Z_5$) progressively discards noise (shared and task-irrelevant features) while sharpening category-predictive features. b, Online Learning (Wake) acts as standard feedforward learning. The encoder module (parameters $\Phi$) maps inputs $X$ to high-fidelity representations $Z$ to predict targets $Y$. Training objectives that maximise task performance can drive both mutual information terms $I(X;Z)$ and $I(Z;Y)$ upwards; however, a high-capacity readout (parameters $\Omega$) minimises error by memorising the input-specific noise retained in $Z$, leading to overfitting. c, Consolidation through iterative latent refinement (modelled as an offline "sleep" phase, cf. the wake-sleep algorithm hinton1995wake). Augmenting the encoder with an offline consolidator $\Psi$ (applied $N$ times) enables a different optimisation: $I(X;Z)$ decreases as $I(Z;Y)$ is maintained (or increases). By actively reducing input dependence, this process enforces an information bottleneck on the downstream readout $\Omega$ (formally constraining the Markov chain $S \to Z \to \Omega$), physically preventing the overfitting that plagues high-capacity single-pass systems and thus tightening the generalisation bound (Equation \ref{['eq:ib-bound']}). Note that this schematic depicts the consolidation of the neocortical representation used for generalisation; the hippocampal system may additionally retain veridical episodic details not captured by the compressed code.
  • Figure 1: The fidelity-generalisation frontier. To rigorously test whether the benefits of consolidation could be replicated by simply increasing regularisation during online training, we compared our Offline Replay model (Purple Star) against two single-pass baselines on the MNIST dataset. All models utilised the same convolutional architecture ($d=64$) described in Section \ref{['sec:ae-consolidation']} to ensure a fair comparison of capacity. First, we trained an online agent with a Variational Information Bottleneck (VIB) objective (Gray circles), sweeping the regularisation strength $\beta \in \{10^{-4}, \dots, 10^0\}$ and performing a fine-grained sweep around the optimal region ($\beta \in \{0.08, \dots, 0.11\}$) to identify maximum performance. Second, we trained an online agent with dropout probabilities $p \in \{0.2, \dots, 0.8\}$ (Blue squares). The baselines define a convex "fidelity-generalisation frontier" (dashed lines), representing the unavoidable trade-off between minimising the gap and maintaining accuracy during single-pass learning. The Offline Sleep model lies beyond this frontier in the top-left quadrant. Data points represent mean $\pm$ standard deviation across $n=50$ independent seeds. This confirms that iterative offline refinement offers a computational advantage that cannot be recovered by online regularisation alone.
  • Figure 2: Iterative refinement tightens the generalisation bound.a, Architecture comprising a frozen convolutional encoder, an iterative latent refiner, and a task-dependent readout network. b-c, Effect of consolidation steps on classification accuracy (b) and the generalisation gap $\Delta$ (c) across five image classification benchmarks (MNIST, Fashion-MNIST, EMNIST, CIFAR-10, SVHN). Refinement consistently improves performance while shrinking $\Delta$. Online-only regularisation baselines defining the fidelity-generalisation frontier for MNIST are shown in Extended Data Fig. \ref{['fig:s1']}. d, Information-theoretic validation (information measured in nats, i.e. natural units). We computed proxies for the mutual information terms in Equation \ref{['eq:ib-bound']}. As predicted, consolidation reduces superfluous input dependence $I(X;Z_t)$ (gray) while increasing task-relevant information $I(Y;Z_t)$ (blue), confirming that the system implements predictive forgetting.
  • Figure 3: The Bidirectional Consolidation Mechanism.a, Wake Perception: Sensory input $x$ is clamped (Lock icon), driving bottom-up inference. The network balances sensory fidelity against internal priors, resulting in a high-fidelity representation where input information $I(x;z)$ is high (Orange). b, Sleep Consolidation: Sensory units are unclamped. A stored memory trace $m$ (representing stable synaptic weights or buffer entries) is retrieved to initialise the transient neural activity $z$. The generative model projects this trace top-down to create a "dream" $\hat{x}$ (Cloud icon), which then drives a second pass of precision-weighted inference. The strong homeostatic prior actively minimises representational cost ($I(x;z) \downarrow$, Red), effectively denoising the trace while preserving task-relevant information ($I(z;y)$, Blue). c, Qualitative Results: Comparison of noisy wake inputs (Left) vs. consolidated sleep replays (Right) across four datasets. The generative loop filters out high-frequency sensory noise, extracting the semantic gist.
  • Figure 4: Consolidation resolves the capacity-generalisation trade-off.a, Generalisation Gap ($\Delta$) vs. Representational Capacity ($d$) on CIFAR-10. In low-capacity regimes ($d < 64$), architectural bottlenecks force such strong compression that models underfit training noise, resulting a negligible or negative gap (driven by regularisation). However, as capacity scales ($d \to 1024$), the Online agent (Orange) suffers catastrophic overfitting, memorising sensory noise. Offline Replay (Purple) drastically reduces this gap, allowing the system to scale to high capacities with a significantly minimised generalisation penalty. b, Test Accuracy. In low-capacity networks, replay provides minimal benefit as the architecture itself enforces compression. In high-capacity regimes ($d \ge 256$), characteristic of mammalian neocortex, replay translates into a net performance gain. The Replay agent (Purple) outperforms the Online agent because the sleep phase applies stronger internal constraints (priors) that are unavailable during rapid online perception. This active filtering removes the input-specific noise that high-capacity networks otherwise memorise, allowing the system to leverage its full capacity for semantic discrimination. c, Mechanism of Compression ($d=512$). Latent norm distributions reveal the physical basis of this effect: strong homeostatic pressure during sleep contracts memory traces toward the generative manifold (Purple), explicitly reducing representational cost $I(X;Z)$ compared to the high-entropy wake state (Orange).
  • ...and 1 more figures