Table of Contents
Fetching ...

FSC-Net: Fast-Slow Consolidation Networks for Continual Learning

Mohamed El Gorrim

TL;DR

FSC-Net tackles catastrophic forgetting in continual learning by separating rapid task learning from slow knowledge consolidation through a dual-network design. The fast NN1 rapidly adapts to new tasks, while the slow NN2 consolidates knowledge via replay and periodic distillation, with pure replay during consolidation found to be most effective. Empirical results on Split-MNIST show a substantial retention boost (NN2: 91.71% ± 0.62% vs NN1: 87.43% ± 1.27%), and CIFAR-10 demonstrates a meaningful but relative improvement (+8.20pp) despite modest absolute performance due to the simple MLP backbone. The findings underscore that consolidation efficacy stems from the training protocol and replay-based rehearsal rather than architectural complexity, and they reveal practical considerations for applying dual-timescale consolidation to broader backbones and task sequences.

Abstract

Continual learning remains challenging due to catastrophic forgetting, where neural networks lose previously acquired knowledge when learning new tasks. Inspired by memory consolidation in neuroscience, we propose FSC-Net (Fast-Slow Consolidation Networks), a dual-network architecture that separates rapid task learning from gradual knowledge consolidation. Our method employs a fast network (NN1) for immediate adaptation to new tasks and a slow network (NN2) that consolidates knowledge through distillation and replay. Within the family of MLP-based NN1 variants we evaluated, consolidation effectiveness is driven more by methodology than architectural embellishments -- a simple MLP outperforms more complex similarity-gated variants by 1.2pp. Through systematic hyperparameter analysis, we observed empirically that pure replay without distillation during consolidation achieves superior performance, consistent with the hypothesis that distillation from the fast network introduces recency bias. On Split-MNIST (30 seeds), FSC-Net achieves 91.71% +/- 0.62% retention accuracy, a +4.27pp gain over the fast network alone (87.43% +/- 1.27%, paired t=23.585, p < 1e-10). On Split-CIFAR-10 (5 seeds), our method achieves 33.31% +/- 0.38% retention with an +8.20pp gain over the fast network alone (25.11% +/- 1.61%, paired t=9.75, p < 1e-3), demonstrating +8.20pp gain, though absolute performance (33.31%) remains modest and below random expectation, highlighting need for stronger backbones. Our results provide empirical evidence that the dual-timescale consolidation mechanism, rather than architectural complexity, is central to mitigating catastrophic forgetting in this setting.

FSC-Net: Fast-Slow Consolidation Networks for Continual Learning

TL;DR

FSC-Net tackles catastrophic forgetting in continual learning by separating rapid task learning from slow knowledge consolidation through a dual-network design. The fast NN1 rapidly adapts to new tasks, while the slow NN2 consolidates knowledge via replay and periodic distillation, with pure replay during consolidation found to be most effective. Empirical results on Split-MNIST show a substantial retention boost (NN2: 91.71% ± 0.62% vs NN1: 87.43% ± 1.27%), and CIFAR-10 demonstrates a meaningful but relative improvement (+8.20pp) despite modest absolute performance due to the simple MLP backbone. The findings underscore that consolidation efficacy stems from the training protocol and replay-based rehearsal rather than architectural complexity, and they reveal practical considerations for applying dual-timescale consolidation to broader backbones and task sequences.

Abstract

Continual learning remains challenging due to catastrophic forgetting, where neural networks lose previously acquired knowledge when learning new tasks. Inspired by memory consolidation in neuroscience, we propose FSC-Net (Fast-Slow Consolidation Networks), a dual-network architecture that separates rapid task learning from gradual knowledge consolidation. Our method employs a fast network (NN1) for immediate adaptation to new tasks and a slow network (NN2) that consolidates knowledge through distillation and replay. Within the family of MLP-based NN1 variants we evaluated, consolidation effectiveness is driven more by methodology than architectural embellishments -- a simple MLP outperforms more complex similarity-gated variants by 1.2pp. Through systematic hyperparameter analysis, we observed empirically that pure replay without distillation during consolidation achieves superior performance, consistent with the hypothesis that distillation from the fast network introduces recency bias. On Split-MNIST (30 seeds), FSC-Net achieves 91.71% +/- 0.62% retention accuracy, a +4.27pp gain over the fast network alone (87.43% +/- 1.27%, paired t=23.585, p < 1e-10). On Split-CIFAR-10 (5 seeds), our method achieves 33.31% +/- 0.38% retention with an +8.20pp gain over the fast network alone (25.11% +/- 1.61%, paired t=9.75, p < 1e-3), demonstrating +8.20pp gain, though absolute performance (33.31%) remains modest and below random expectation, highlighting need for stronger backbones. Our results provide empirical evidence that the dual-timescale consolidation mechanism, rather than architectural complexity, is central to mitigating catastrophic forgetting in this setting.

Paper Structure

This paper contains 40 sections, 6 equations, 3 figures, 8 tables, 2 algorithms.

Figures (3)

  • Figure 1: FSC-Net architecture overview. The system employs dual networks: NN1 (fast, red) rapidly adapts to new tasks with a high learning rate ($10^{-3}$), while NN2 (slow, teal) consolidates knowledge with a lower learning rate ($5\times10^{-4}$). Input $x$ feeds both networks, with NN1 providing a summary embedding $s$ to NN2. The replay buffer stores samples from all tasks for offline consolidation. During task training, knowledge distillation ($\lambda=0.3$) helps NN2 track NN1's adaptation. During offline consolidation, pure replay ($\lambda=0$) provides superior performance by avoiding recency bias from NN1's task-specific predictions.
  • Figure 2: Split-MNIST results across 30 seeds ($\lambda$=0.0 investigation). Top-left: Distribution of final retention shows NN2 consistently outperforms NN1. Top-right: Retention degrades across tasks but NN2 maintains higher accuracy. Bottom-left: Box plot shows NN2's tighter distribution (0.62% std vs 1.27%). Bottom-right: Summary statistics validate ablation findings. Raw data: results/simple_mlp/csv/split_mnist_30seeds_final_20251111_143325.csv.
  • Figure 3: Hyperparameter sensitivity analysis on Split-MNIST (seed 42). Performance is robust to initialization (30-seed validation confirms this), though robustness to hyperparameter choice itself was not systematically evaluated due to single-seed tuning.