DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Victor Quétu; Enzo Tartaglione

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Victor Quétu, Enzo Tartaglione

TL;DR

The paper addresses sparse double descent (SDD) observed when pruning large neural networks, showing that traditional early stopping can fail in highly over-parameterized regimes. It introduces a KD-based framework where a student learns from a sparse (or dense) teacher in its best-fit region, transferring regularization properties to dodge SDD. An entropy-based diagnostic monitors learning dynamics, linking regime transitions to changes in activation entropy and enabling practical early stopping. Empirical results on CIFAR-10/100 and related setups demonstrate monotonic generalization and substantial computational savings when using distillation from sparse teachers, highlighting a viable approach to training robust, compact models in noisy settings.

Abstract

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

TL;DR

Abstract

Paper Structure (45 sections, 2 equations, 14 figures, 14 tables, 2 algorithms)

This paper contains 45 sections, 2 equations, 14 figures, 14 tables, 2 algorithms.

Introduction
Related works
The real world is noisy
Double Descent in classification tasks.
Model size and sparse double descent
Background on neural network's pruning
Pruning exhibits sparse double descent
Setup
Experiments
Better low parametrization or extreme over parametrization?
Critical phase occurrence
An entropy-based interpretation to the sparse double descent
Generalization gap in deep double descent: relationships between DD and SDD
Results
Distilling knowledge to avoid the sparse double descent
...and 30 more sections

Figures (14)

Figure 1: Distilling knowledge from a sparse teacher grants access to solutions (for the student model) where SDD is dodged, also saving computation.
Figure 2: Performance of ResNet-18 with different amount of noise $\varepsilon$ on CIFAR-10 (a) and CIFAR-100 (b). I: Light Phase. II: Critical Phase. III: Sweet Phase. IV: Collapsed Phase.
Figure 3: Performance of ResNet-18 on CIFAR-100 with $\varepsilon=10\%$ when retrained from either the original initialization (lottery ticket), a random re-initialization, or from the last configuration achieved before pruning.
Figure 4: Performance of VGG-like model, vanilla-trained (a, b), distilled from a sparse teacher (c, d), varying the depth $\delta$ (a, c) and the width $\gamma$ (b, d) on CIFAR-10 with $\varepsilon=50\%$.
Figure 5: Performance of the VGG-like model on CIFAR-10 (a, b, c) and CIFAR-100 (d, e, f) for different label noises. Left:$~{\varepsilon=10\%}$. Middle:$\varepsilon=20\%$. Right:$\varepsilon=50\%$.
...and 9 more figures

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

TL;DR

Abstract

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Authors

TL;DR

Abstract

Table of Contents

Figures (14)