Table of Contents
Fetching ...

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

Victor Quétu, Enzo Tartaglione

TL;DR

The paper addresses sparse double descent (SDD) observed when pruning large neural networks, showing that traditional early stopping can fail in highly over-parameterized regimes. It introduces a KD-based framework where a student learns from a sparse (or dense) teacher in its best-fit region, transferring regularization properties to dodge SDD. An entropy-based diagnostic monitors learning dynamics, linking regime transitions to changes in activation entropy and enabling practical early stopping. Empirical results on CIFAR-10/100 and related setups demonstrate monotonic generalization and substantial computational savings when using distillation from sparse teachers, highlighting a viable approach to training robust, compact models in noisy settings.

Abstract

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.

DSD$^2$: Can We Dodge Sparse Double Descent and Compress the Neural Network Worry-Free?

TL;DR

The paper addresses sparse double descent (SDD) observed when pruning large neural networks, showing that traditional early stopping can fail in highly over-parameterized regimes. It introduces a KD-based framework where a student learns from a sparse (or dense) teacher in its best-fit region, transferring regularization properties to dodge SDD. An entropy-based diagnostic monitors learning dynamics, linking regime transitions to changes in activation entropy and enabling practical early stopping. Empirical results on CIFAR-10/100 and related setups demonstrate monotonic generalization and substantial computational savings when using distillation from sparse teachers, highlighting a viable approach to training robust, compact models in noisy settings.

Abstract

Neoteric works have shown that modern deep learning models can exhibit a sparse double descent phenomenon. Indeed, as the sparsity of the model increases, the test performance first worsens since the model is overfitting the training data; then, the overfitting reduces, leading to an improvement in performance, and finally, the model begins to forget critical information, resulting in underfitting. Such a behavior prevents using traditional early stop criteria. In this work, we have three key contributions. First, we propose a learning framework that avoids such a phenomenon and improves generalization. Second, we introduce an entropy measure providing more insights into the insurgence of this phenomenon and enabling the use of traditional stop criteria. Third, we provide a comprehensive quantitative analysis of contingent factors such as re-initialization methods, model width and depth, and dataset noise. The contributions are supported by empirical evidence in typical setups. Our code is available at https://github.com/VGCQ/DSD2.
Paper Structure (45 sections, 2 equations, 14 figures, 14 tables, 2 algorithms)

This paper contains 45 sections, 2 equations, 14 figures, 14 tables, 2 algorithms.

Figures (14)

  • Figure 1: Distilling knowledge from a sparse teacher grants access to solutions (for the student model) where SDD is dodged, also saving computation.
  • Figure 2: Performance of ResNet-18 with different amount of noise $\varepsilon$ on CIFAR-10 (a) and CIFAR-100 (b). I: Light Phase. II: Critical Phase. III: Sweet Phase. IV: Collapsed Phase.
  • Figure 3: Performance of ResNet-18 on CIFAR-100 with $\varepsilon=10\%$ when retrained from either the original initialization (lottery ticket), a random re-initialization, or from the last configuration achieved before pruning.
  • Figure 4: Performance of VGG-like model, vanilla-trained (a, b), distilled from a sparse teacher (c, d), varying the depth $\delta$ (a, c) and the width $\gamma$ (b, d) on CIFAR-10 with $\varepsilon=50\%$.
  • Figure 5: Performance of the VGG-like model on CIFAR-10 (a, b, c) and CIFAR-100 (d, e, f) for different label noises. Left:$~{\varepsilon=10\%}$. Middle:$\varepsilon=20\%$. Right:$\varepsilon=50\%$.
  • ...and 9 more figures