Revisiting Softmax Masking: Stop Gradient for Enhancing Stability in Replay-based Continual Learning
Hoyong Kim, Minchan Kwon, Kangil Kim
TL;DR
The paper addresses catastrophic forgetting in replay-based continual learning by focusing on the pull-push dynamics induced by cross-entropy with softmax. It revisits softmax masking and introduces a general masked softmax that replaces non-current-task logits with a mask value m and stops gradient flow on masked entries, enabling explicit control of gradient flow and stability. The authors show that negative infinity masking ($m=- inf$) can boost stability but may conflict with dark knowledge, and they propose a flexible masking strategy that balances stability and plasticity; distillation with masked softmax can be dangerous, so they emphasize careful use. Across standard CL benchmarks and low-buffer scenarios, the method improves final accuracy and reduces forgetting, with tunable masking values offering a practical handle on the stability-plasticity trade-off and applicability to extremely small episodic memories. Overall, the work provides a principled mechanism to control inter-task interference in replay-based CL and demonstrates its effectiveness in improving robustness when memory is severely limited.
Abstract
In replay-based methods for continual learning, replaying input samples in episodic memory has shown its effectiveness in alleviating catastrophic forgetting. However, the potential key factor of cross-entropy loss with softmax in causing catastrophic forgetting has been underexplored. In this paper, we analyze the effect of softmax and revisit softmax masking with negative infinity to shed light on its ability to mitigate catastrophic forgetting. Based on the analyses, it is found that negative infinity masked softmax is not always compatible with dark knowledge. To improve the compatibility, we propose a general masked softmax that controls the stability by adjusting the gradient scale to old and new classes. We demonstrate that utilizing our method on other replay-based methods results in better performance, primarily by enhancing model stability in continual learning benchmarks, even when the buffer size is set to an extremely small value.
