Hard ASH: Sparsity and the right optimizer make a continual learner
Santtu Keskinen
TL;DR
The paper tackles catastrophic forgetting in class-incremental learning by showing that sparse representations, implemented via Adaptive SwisH (ASH) and a new variant Hard ASH, combined with adaptive learning-rate optimizers (notably Adagrad), can approach the performance of established regularization methods on Split-MNIST without task-boundary information. It introduces Hard ASH and demonstrates through experiments that sparsity—especially when paired with Adagrad—yields strong retention while maintaining plasticity, with additional analyses on gradient sparsity and plasticity using permuted MNIST. The results suggest a simpler, computationally efficient route to continual learning that emphasizes optimizer choice and gradient sparsity over complex regularization schemes. This work highlights the practical impact of sparse activations and adaptive optimization for scalable continual-learning systems.
Abstract
In class incremental learning, neural networks typically suffer from catastrophic forgetting. We show that an MLP featuring a sparse activation function and an adaptive learning rate optimizer can compete with established regularization techniques in the Split-MNIST task. We highlight the effectiveness of the Adaptive SwisH (ASH) activation function in this context and introduce a novel variant, Hard Adaptive SwisH (Hard ASH) to further enhance the learning retention.
