Table of Contents
Fetching ...

Learned Random Label Predictions as a Neural Network Complexity Metric

Marlon Becker, Benjamin Risse

TL;DR

This work probes whether memorizing randomly generated labels in parallel with real class labels reflects neural network complexity and generalization potential. By adding per-class random-label heads and three losses, the authors quantify memorization with a complexity proxy inspired by $\mathfrak{R}_n(\mathcal{H})$ and introduce a regularizer to unlearn random labels. They show common regularizers reduce memorization as measured by the random-label accuracy, but crucially observe no improvement in generalization on CIFAR-100, challenging the straightforward link between memorization and generalization. The findings also reveal where in the network the transition from sample-specific to class-specific information occurs, and they raise questions about the conditions under which reducing memorization yields practical performance gains.

Abstract

We empirically investigate the impact of learning randomly generated labels in parallel to class labels in supervised learning on memorization, model complexity, and generalization in deep neural networks. To this end, we introduce a multi-head network architecture as an extension of standard CNN architectures. Inspired by methods used in fair AI, our approach allows for the unlearning of random labels, preventing the network from memorizing individual samples. Based on the concept of Rademacher complexity, we first use our proposed method as a complexity metric to analyze the effects of common regularization techniques and challenge the traditional understanding of feature extraction and classification in CNNs. Second, we propose a novel regularizer that effectively reduces sample memorization. However, contrary to the predictions of classical statistical learning theory, we do not observe improvements in generalization.

Learned Random Label Predictions as a Neural Network Complexity Metric

TL;DR

This work probes whether memorizing randomly generated labels in parallel with real class labels reflects neural network complexity and generalization potential. By adding per-class random-label heads and three losses, the authors quantify memorization with a complexity proxy inspired by and introduce a regularizer to unlearn random labels. They show common regularizers reduce memorization as measured by the random-label accuracy, but crucially observe no improvement in generalization on CIFAR-100, challenging the straightforward link between memorization and generalization. The findings also reveal where in the network the transition from sample-specific to class-specific information occurs, and they raise questions about the conditions under which reducing memorization yields practical performance gains.

Abstract

We empirically investigate the impact of learning randomly generated labels in parallel to class labels in supervised learning on memorization, model complexity, and generalization in deep neural networks. To this end, we introduce a multi-head network architecture as an extension of standard CNN architectures. Inspired by methods used in fair AI, our approach allows for the unlearning of random labels, preventing the network from memorizing individual samples. Based on the concept of Rademacher complexity, we first use our proposed method as a complexity metric to analyze the effects of common regularization techniques and challenge the traditional understanding of feature extraction and classification in CNNs. Second, we propose a novel regularizer that effectively reduces sample memorization. However, contrary to the predictions of classical statistical learning theory, we do not observe improvements in generalization.

Paper Structure

This paper contains 15 sections, 1 theorem, 7 equations, 6 figures.

Key Result

Theorem 1

Given a hypothesis class $\mathcal{H}$, train data $\mathcal{S} = \{(x_1,\sigma_1),...,(x_n,\sigma_n)\}$, with $\sigma_1,...,\sigma_n \in \{\pm1\}$, then for any $\delta > 0$, with probability at least $1-\delta$ for any $h \in \mathcal{H}$ it holds, that

Figures (6)

  • Figure 1: A: Illustration of the multi-head architecture that is built on top of the feature extractor to predict random labels. B: While test and train accuracy converge quickly, the random label accuracy still increases after reaching nearly $100\%$ train accuracy, and finally also reaches close to $100\%$.
  • Figure 2: The effect of common complexity regularizers can be measured with the proposed metric. A: Dropout. B: Weight decay. C: Label smoothing.
  • Figure 3: A: Effect of the number of copied layers, i.e., the copy depth $d$, on the random label prediction accuracy. VGG16 trained on CIFAR100 with SGD. Number of RND labels $n=2$, so $50\%$ accuracy corresponds to no memorization. B+C: While the regularization effectively reduces memorization, especially when no weight decay is used, there is no improvement in test accuracy.
  • Figure A.1: Dependence of regularization factor $\lambda$ and learning rate $\eta$. WideResNet-16-4 trained on CIFAR100 with number of random labels $n=10$. A: No augmentation. B: Including flipping, cropping and cutout. If the learning rate is too low, increasing $\lambda$ may have a positive effect, which could be misinterpreted as a reduction in memorization, but is actually caused by implicit learning rate fine-tuning of the regularizer.
  • Figure A.2: WRN16-4 trained on CIFAR100 with number of random labels $n=100$. The proposed random prediction regularizer improves the test accuracy only when low label smoothing is chosen.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1