DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?
Victor Quétu, Enzo Tartaglione
TL;DR
The paper addresses sparse double descent (SDD) observed when pruning large neural networks, showing that traditional early stopping can fail in highly over-parameterized regimes. It introduces a KD-based framework where a student learns from a sparse (or dense) teacher in its best-fit region, transferring regularization properties to dodge SDD. An entropy-based diagnostic monitors learning dynamics, linking regime transitions to changes in activation entropy and enabling practical early stopping. Empirical results on CIFAR-10/100 and related setups demonstrate monotonic generalization and substantial computational savings when using distillation from sparse teachers, highlighting a viable approach to training robust, compact models in noisy settings.
Abstract
Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.
