On the Diminishing Returns of Width for Continual Learning

Etash Guha; Vihan Lakshman

On the Diminishing Returns of Width for Continual Learning

Etash Guha, Vihan Lakshman

TL;DR

This work addresses catastrophic forgetting in continual learning by establishing a finite-width theoretical framework that links network width $W$, depth $L$, sparsity $\alpha$, and task count to the continual-learning error $\epsilon_{t,t'}$. It shows that increasing width yields diminishing returns in reducing forgetting, with a bound that decays as $W^{-\beta}$ and scales with the number of tasks, while empirical results across rotated vision benchmarks and Wide ResNet architectures validate both the trend and the underlying lazy-training intuition. The analysis reveals that width acts as a functional regularizer, depth exacerbates forgetting, and row-wise sparsity can further mitigate forgetting without sacrificing accuracy, offering a nuanced view beyond simply widening networks. The results have practical implications for designing scalable continual learning systems, suggesting that combining width with targeted sparsity and regularization strategies may better manage forgetting than width alone.

Abstract

While deep neural networks have demonstrated groundbreaking performance in various settings, these models often suffer from \emph{catastrophic forgetting} when trained on new tasks in sequence. Several works have empirically demonstrated that increasing the width of a neural network leads to a decrease in catastrophic forgetting but have yet to characterize the exact relationship between width and continual learning. We design one of the first frameworks to analyze Continual Learning Theory and prove that width is directly related to forgetting in Feed-Forward Networks (FFN). Specifically, we demonstrate that increasing network widths to reduce forgetting yields diminishing returns. We empirically verify our claims at widths hitherto unexplored in prior studies where the diminishing returns are clearly observed as predicted by our theory.

On the Diminishing Returns of Width for Continual Learning

TL;DR

This work addresses catastrophic forgetting in continual learning by establishing a finite-width theoretical framework that links network width

, depth

, sparsity

, and task count to the continual-learning error

. It shows that increasing width yields diminishing returns in reducing forgetting, with a bound that decays as

and scales with the number of tasks, while empirical results across rotated vision benchmarks and Wide ResNet architectures validate both the trend and the underlying lazy-training intuition. The analysis reveals that width acts as a functional regularizer, depth exacerbates forgetting, and row-wise sparsity can further mitigate forgetting without sacrificing accuracy, offering a nuanced view beyond simply widening networks. The results have practical implications for designing scalable continual learning systems, suggesting that combining width with targeted sparsity and regularization strategies may better manage forgetting than width alone.

Abstract

Paper Structure (39 sections, 10 theorems, 41 equations, 14 figures, 5 tables)

This paper contains 39 sections, 10 theorems, 41 equations, 14 figures, 5 tables.

Introduction
Contributions
Related Works
Continual Learning
Wide Networks
Preliminary
Notation
Problem Setup
Training Setup
Theoretical Analysis
Main Theorem
Proof Sketch
Number of Shared Active Rows
Distance between active rows after training
Width as a Functional Reguralizer
...and 24 more sections

Key Result

Theorem 4.1

(Informal) Say we generate a series of models $\mathbf{M}_1, \dots, \mathbf{M}_T$ by training sequentially on datasets $\mathcal{D}_1, \dots, \mathcal{D}_T$ according to sec:training_setup. Let $\lambda_{i, j}^l = \frac{\|\mathbf{A}_{l, j}[\mathcal{A}_{l, i}]\|_2}{\|\mathbf{A}_{l, i}[\mathcal{A}_{l, Here, $\chi$ denotes the maximum norm of the input in $\mathcal{D}_t$, i.e. $\chi = \underset{x \in

Figures (14)

Figure 1: We plot the distance from initialization for both Rotated MNIST and Rotated Fashion MNIST experiments. We see that distance from initialization decreases slowly as the width is increased for both datasets. For the constants discussed in \ref{['ass:distance']}, the best fitting constants are $\gamma = 0.013, \beta=0.311$ for Rotated MNIST and $\gamma = 2.5, \beta=0.12$ for Fashion MNIST. We plot the predicted relationship with such parameters from \ref{['ass:distance']}.
Figure 1: Our Continual Learning experiments on varying width FFNs on Rotated MNIST and Rotated Fashion MNIST. We see that the Average Forgetting slowly stops decreasing after a width of $2^{10}$.
Figure 2: We visualize the diminishing returns of increasing width across networks of varying depth. This corroborates our theoretical analysis.
Figure 2: We report the numbers from our continual learning experiments using a Wide ResNet model zagoruyko2016wide on the SVHN and GTSRB datasets. The width reported in the first column corresponds to the multiplicative amount applied to the width factor parameter of the Wide ResNet model. We again see a similar trend of diminishing returns as we did in the MLP setting.
Figure 3: We visualize the relationship between depth and task index on forgetting over several datasets. \ref{['fig:mnisterrorovertime']} is only for Rotated MNIST.
...and 9 more figures

Theorems & Definitions (17)

Theorem 4.1
Lemma 4.1
Lemma 4.1
Lemma 4.1
Definition 4.1
Definition 4.1
Theorem 4.2
Lemma 3.0
proof
Lemma 3.0
...and 7 more

On the Diminishing Returns of Width for Continual Learning

TL;DR

Abstract

On the Diminishing Returns of Width for Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (17)