Table of Contents
Fetching ...

Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

Yufei Gu, Xiaoqing Zheng, Tomaso Aste

TL;DR

Problem: why does double descent occur in deep networks trained with noisy labels? Approach: replicate the phenomenon across FCNN, CNN, and ResNet18, and interpret the learned feature space by applying kNN to quantify how noisy samples cluster with clean counterparts. Findings: double descent peaks intensify with label noise; over-parameterized models tend to isolate noise in the learned representations, and P tracks generalization across architectures and datasets. Significance: offers a concrete mechanism for double descent via implicit regularization, suggesting new analytic directions and enabling reproducibility through public code.

Abstract

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise.

Unraveling the Enigma of Double Descent: An In-depth Analysis through the Lens of Learned Feature Space

TL;DR

Problem: why does double descent occur in deep networks trained with noisy labels? Approach: replicate the phenomenon across FCNN, CNN, and ResNet18, and interpret the learned feature space by applying kNN to quantify how noisy samples cluster with clean counterparts. Findings: double descent peaks intensify with label noise; over-parameterized models tend to isolate noise in the learned representations, and P tracks generalization across architectures and datasets. Significance: offers a concrete mechanism for double descent via implicit regularization, suggesting new analytic directions and enabling reproducibility through public code.

Abstract

Double descent presents a counter-intuitive aspect within the machine learning domain, and researchers have observed its manifestation in various models and tasks. While some theoretical explanations have been proposed for this phenomenon in specific contexts, an accepted theory to account for its occurrence in deep learning remains yet to be established. In this study, we revisit the phenomenon of double descent and demonstrate that its occurrence is strongly influenced by the presence of noisy data. Through conducting a comprehensive analysis of the feature space of learned representations, we unveil that double descent arises in imperfect models trained with noisy data. We argue that double descent is a consequence of the model first learning the noisy data until interpolation and then adding implicit regularization via over-parameterization acquiring therefore capability to separate the information from the noise.
Paper Structure (11 sections, 1 equation, 6 figures)

This paper contains 11 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: The phenomenon of double descent on two-layer FCNNs trained on MNIST $(N=4000)$, under varying explicit label noise ratios of $p = [0\%, 10\%, 20\%]$ and the prediction accuracy of noisy labelled data denoted as $P$ when $p > 0$. Optimized with SGD for 4000 epochs and decreasing learning rate. The test error curve of $p = [10\%, 20\%]$ performs the double descent phenomenon and the prediction accuracy $P$ of in-context noisy data learning showed a correlation generalization performance beyond context.
  • Figure 2: The phenomenon of double descent on five-layer CNNs trained on CIFAR-10 ($N=50000$), under varying label noise ratios of $p = [0\%, 10\%, 20\%]$ and the prediction accuracy of noisy labelled data denoted as $P$ when $p > 0$. Optimized with SGD for 200 epochs and decreasing learning rate. The test error curve of $p = [0\%, 10\%, 20\%]$ performs the double descent phenomenon and the prediction accuracy $P$ of in-context noisy data learning showed a correlation generalization performance beyond context.
  • Figure 3: The phenomenon of double descent on ResNet18s trained on CIFAR-10 $(N=50000)$, under varying label noise of $p = [0\%, 10\%, 20\%]$ and the prediction accuracy $P$ when $p > 0$. Optimized with SGD for 200 epochs and decreasing learning rate. The test error curve under $p = [0\%, 10\%, 20\%]$ performs the double descent phenomenon. The k-NN prediction accuracy $P$ on the clean labels first increases and then decreases to 0 at the interpolation threshold. One additional purple line on predicting the noisy labels is introduced showcasing that the k-NN predicts all noisy data with their clean labels.
  • Figure 4: Exemplification of the emergence of the double descent effect in the perspective of the present paper. Starting from the bottom of the left panel: 1. the perfect learner, in the absence of noise, will learn the perfect model containing several parameters approximately equal to the number of significant examples; 2. in the presence of noise, the perfect learner needs to learn also to distinguish information from noise and it will learn a perfect model with several parameters which is higher the number of noisy examples; 3. the imperfect learner with optimal regularization will learn a well-performing model which distinguishes information from noise using several parameters which is higher the number of examples but should not show double descent peak. On the right panel: 4. the imperfect learner with sub-optimal regularization will learn a well-performing model first interpolating noise and signal and then learning to distinguish information from noise using several parameters which is higher the number of examples showing the double descent peak; 5. the larger is the noise, the higher the double descent peak is, and the larger is the number of parameters needed to distinguish information from noise.
  • Figure 5: Scaling the number of parameters as model size with layer width unit k of the three neural architectures used in our experiments in the Methodology section. We apply a logarithmic scale to all neural architectures' parameter counts $P$.
  • ...and 1 more figures