On the Diminishing Returns of Width for Continual Learning
Etash Guha, Vihan Lakshman
TL;DR
This work addresses catastrophic forgetting in continual learning by establishing a finite-width theoretical framework that links network width $W$, depth $L$, sparsity $\alpha$, and task count to the continual-learning error $\epsilon_{t,t'}$. It shows that increasing width yields diminishing returns in reducing forgetting, with a bound that decays as $W^{-\beta}$ and scales with the number of tasks, while empirical results across rotated vision benchmarks and Wide ResNet architectures validate both the trend and the underlying lazy-training intuition. The analysis reveals that width acts as a functional regularizer, depth exacerbates forgetting, and row-wise sparsity can further mitigate forgetting without sacrificing accuracy, offering a nuanced view beyond simply widening networks. The results have practical implications for designing scalable continual learning systems, suggesting that combining width with targeted sparsity and regularization strategies may better manage forgetting than width alone.
Abstract
While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.
