Table of Contents
Fetching ...

Activation Function Design Sustains Plasticity in Continual Learning

Lute Lillo, Nick Cheney

TL;DR

This work tackles loss of plasticity in continual learning by highlighting activation function geometry as a fundamental, domain-general lever. It introduces Smooth-Leaky and Randomized Smooth-Leaky, drop-in nonlinearities that preserve a non-zero derivative floor and exhibit a $C^1$ transition, and demonstrates their effectiveness in both continual supervised benchmarks and non-stationary reinforcement learning. Through a property-level analysis of negative-slope behavior and saturation, plus a desaturation stress protocol, the authors show that careful activation design sustains adaptation to shifting distributions without increasing capacity. The study also provides robust diagnostic metrics, including a Plasticity Score and Generalization Gap, to quantify trainability versus transfer. Collectively, the results argue that activation-function design offers a lightweight, domain-general path to mitigate plasticity loss across tasks and environments, informing future hardware and algorithmic choices for continual learning.

Abstract

In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.

Activation Function Design Sustains Plasticity in Continual Learning

TL;DR

This work tackles loss of plasticity in continual learning by highlighting activation function geometry as a fundamental, domain-general lever. It introduces Smooth-Leaky and Randomized Smooth-Leaky, drop-in nonlinearities that preserve a non-zero derivative floor and exhibit a transition, and demonstrates their effectiveness in both continual supervised benchmarks and non-stationary reinforcement learning. Through a property-level analysis of negative-slope behavior and saturation, plus a desaturation stress protocol, the authors show that careful activation design sustains adaptation to shifting distributions without increasing capacity. The study also provides robust diagnostic metrics, including a Plasticity Score and Generalization Gap, to quantify trainability versus transfer. Collectively, the results argue that activation-function design offers a lightweight, domain-general path to mitigate plasticity loss across tasks and environments, informing future hardware and algorithmic choices for continual learning.

Abstract

In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt (referred to as loss of plasticity) and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities (Smooth-Leaky and Randomized Smooth-Leaky) and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.

Paper Structure

This paper contains 53 sections, 16 equations, 14 figures, 24 tables.

Figures (14)

  • Figure 1: A: Final accuracy vs. effective negative slope $\bar{s}$. B: Dead-unit fraction vs. $\bar{s}$. Linear-leak families peak for $\bar{s}\!\in\![0.6,0.9]$. Smooth-tailed activations are plotted on the same $\bar{s}$ axis; they underperform within the 'Goldilocks zone' and only approach the linear-leak peak when $\bar{s}\!>\!1$, reflecting concentrated near-zero responsiveness and vanishing tails. C: Effective rank of the gradient Gram matrix. D: Dominant $\lambda_{\max}$. Smooth-tailed activations show spikes at large $\bar{s}$, while constant-slope leaks remain comparatively stable.
  • Figure 2: Desaturation under scaling shocks $\gamma$. Left: mean AUSC (lower is better). Middle: SF recovery time (epochs to halve the saturated fraction after the shock; successful recoveries only). Right: SF non-recovery rate (%). Groups: Zero-floor = ReLU, Tanh, Sigmoid; Non-zero-floor = Leaky-ReLU, RReLU, PReLU; Effective non-zero-floor = ELU, CELU, SELU, GELU, Swish. See App. \ref{['sec:h2_1_der_floor']} for details.
  • Figure 3: Sidedness effects under shocks.Left: Peak saturated fraction during the shock (higher = more units saturated). Middle: Saturation Fraction (SF) time-to-half-recover (epochs; successful recoveries only; lower is better). Right: AUSC (lower is better). Groups: One-sided (kink) = Leaky-ReLU, PReLU, RReLU; One-sided (smooth) = ELU, CELU, SELU; Two-sided (saturating) = Sigmoid, Tanh. See App. \ref{['sec:h2_2_two_side_penalty']} for details.
  • Figure 4: Correlation of Dead-Band Width Score with Saturation Recovery Metrics (All Gammas Aggregated).(Left): Average Area Under Saturation Curve (Avg. AUSC) vs. Dead-Band Width Score. A strong positive correlation (Pearson $r=0.81,\;p=0.0016$) is observed. (Middle): Average Saturation Fraction (SF) Recovery Time (for successful recoveries, measured by epochs) vs. Dead-Band Width Score. No significant correlation is found (Pearson $r=-0.25,\;p=0.45$). (Right): Average SF Non-Recovery Rate (%) vs. Dead-Band Width Score. A strong positive correlation (Pearson $r=0.84,\;p=0.0013$) is observed, indicating functions more prone to saturation are more likely to fail SF recovery.
  • Figure 5: Smooth-Leaky with $\alpha{=}0.1$, $p{=}3.0$, $c{=}5.0$. Randomized Smooth-Leaky draws $\alpha$ from bounds.
  • ...and 9 more figures