Hard ASH: Sparsity and the right optimizer make a continual learner

Santtu Keskinen

Hard ASH: Sparsity and the right optimizer make a continual learner

Santtu Keskinen

TL;DR

The paper tackles catastrophic forgetting in class-incremental learning by showing that sparse representations, implemented via Adaptive SwisH (ASH) and a new variant Hard ASH, combined with adaptive learning-rate optimizers (notably Adagrad), can approach the performance of established regularization methods on Split-MNIST without task-boundary information. It introduces Hard ASH and demonstrates through experiments that sparsity—especially when paired with Adagrad—yields strong retention while maintaining plasticity, with additional analyses on gradient sparsity and plasticity using permuted MNIST. The results suggest a simpler, computationally efficient route to continual learning that emphasizes optimizer choice and gradient sparsity over complex regularization schemes. This work highlights the practical impact of sparse activations and adaptive optimization for scalable continual-learning systems.

Abstract

In class incremental learning, neural networks typically suffer from catastrophic forgetting. We show that an MLP featuring a sparse activation function and an adaptive learning rate optimizer can compete with established regularization techniques in the Split-MNIST task. We highlight the effectiveness of the Adaptive SwisH (ASH) activation function in this context and introduce a novel variant, Hard Adaptive SwisH (Hard ASH) to further enhance the learning retention.

Hard ASH: Sparsity and the right optimizer make a continual learner

TL;DR

Abstract

Paper Structure (17 sections, 3 figures, 6 tables)

This paper contains 17 sections, 3 figures, 6 tables.

Introduction
ASH and Hard ASH
Hard ASH
Experiment
Conclusions
URM Statement
Appendix
Reproducibility
Hard ASH formula
Network Initialization
Weight normalization
Full results
Hyperparameters
Performance without task splits
Adam and bias correction
...and 2 more sections

Figures (3)

Figure 1: Overall and per-task validation accuracies of a single run of each method. Vertical lines represent the points in the training where the task changes. Optimizer is Adagrad when not specified. Best methods slowly lose accuracy on old tasks, but struggle to learn the last task. ReLU forgets the old tasks even with good optimizer like Adagrad. Meanwhile Hard ASH keeps some old-task performance even with plain SGD. Variations between runs are small enough to be barely visible.
Figure 2: Latest task and first task validation accuracies when varying $z_k$
Figure 3: Latest task and first task validation accuracies when varying $\alpha$

Hard ASH: Sparsity and the right optimizer make a continual learner

TL;DR

Abstract

Hard ASH: Sparsity and the right optimizer make a continual learner

Authors

TL;DR

Abstract

Table of Contents

Figures (3)