Benign Overfitting under Learning Rate Conditions for $α$ Sub-exponential Input

Kota Okudo; Kei Kobayashi

Benign Overfitting under Learning Rate Conditions for $α$ Sub-exponential Input

Kota Okudo, Kei Kobayashi

TL;DR

It is shown that under certain conditions on the dimensionality $p$ and the distance between the centers of the distributions, the misclassification error of the maximum margin classifier asymptotically approaches the noise level, the theoretical optimal value.

Abstract

This paper investigates the phenomenon of benign overfitting in binary classification problems with heavy-tailed input distributions, extending the analysis of maximum margin classifiers to $α$ sub-exponential distributions ($α\in (0, 2]$). This generalizes previous work focused on sub-gaussian inputs. We provide generalization error bounds for linear classifiers trained using gradient descent on unregularized logistic loss in this heavy-tailed setting. Our results show that, under certain conditions on the dimensionality $p$ and the distance between the centers of the distributions, the misclassification error of the maximum margin classifier asymptotically approaches the noise level, the theoretical optimal value. Moreover, we derive an upper bound on the learning rate $β$ for benign overfitting to occur and show that as the tail heaviness of the input distribution $α$ increases, the upper bound on the learning rate decreases. These results demonstrate that benign overfitting persists even in settings with heavier-tailed inputs than previously studied, contributing to a deeper understanding of the phenomenon in more realistic data environments.

Benign Overfitting under Learning Rate Conditions for $α$ Sub-exponential Input

TL;DR

It is shown that under certain conditions on the dimensionality

and the distance between the centers of the distributions, the misclassification error of the maximum margin classifier asymptotically approaches the noise level, the theoretical optimal value.

Abstract

This paper investigates the phenomenon of benign overfitting in binary classification problems with heavy-tailed input distributions, extending the analysis of maximum margin classifiers to

sub-exponential distributions (

). This generalizes previous work focused on sub-gaussian inputs. We provide generalization error bounds for linear classifiers trained using gradient descent on unregularized logistic loss in this heavy-tailed setting. Our results show that, under certain conditions on the dimensionality

and the distance between the centers of the distributions, the misclassification error of the maximum margin classifier asymptotically approaches the noise level, the theoretical optimal value. Moreover, we derive an upper bound on the learning rate

for benign overfitting to occur and show that as the tail heaviness of the input distribution

increases, the upper bound on the learning rate decreases. These results demonstrate that benign overfitting persists even in settings with heavier-tailed inputs than previously studied, contributing to a deeper understanding of the phenomenon in more realistic data environments.

Paper Structure (40 sections, 28 theorems, 118 equations, 6 figures, 3 tables)

This paper contains 40 sections, 28 theorems, 118 equations, 6 figures, 3 tables.

Introduction
Related works
Preliminaries
Notation
$\alpha$ sub-exponential random variable
Data generation process
Maximum margin algorithm
Assumptions
Main results
Generalization bound
Learning rate
Sketch of proof of Theorem \ref{['mainthm']}
Simulation
Data generation
Model training
...and 25 more sections

Key Result

Theorem 4

For any $\alpha \in (0,2]$ and $\kappa\in(0,1)$, there exists a constant $c>0$ such that, under assumptions (A1)-(A5), for all large enough $C$, with probability at least $1-\delta$, the maximum margin classifier $w$ satisfies

Figures (6)

Figure 1: Boxplot of estimated tail index $\xi$ for feature vector components extracted from the intermediate layers of a CNN with ReLU activation, trained on various datasets (CIFAR-10 krizhevsky2009learning, CIFAR-100 krizhevsky2009learning, Fashion-MNIST xiao2017fashion, SVHN netzer2011reading). The tail index $\xi$ represents the heaviness of the distribution tails, with smaller values indicating heavier tails. The Gaussian and Exponential distributions are included for comparison purposes and were not passed through the CNN. The results indicate that the feature vectors for certain datasets, have heavier-tailed distributions than the Gaussian distribution. Further details are found in Appendix \ref{['appendix:intro_xi']}.
Figure 2: Training and test errors versus dimension $p$ for a maximum margin classifier. $n_\mathrm{train}=200$, $n_\mathrm{test}=1000$, $p$ ranges from 100 to 1500. Data is generated from a heavy-tailed setting using a generalized normal distribution, as detailed in Section \ref{['section:heavy_tailed_setting']} and Appendix \ref{['appendix:intro_benign']}. The shape parameters are $\gamma = 0.25, 0.5, 2$, with variance normalized to 1. Noise level $\eta$ is $0.05$ (dotted line). Solid and dashed lines show training and test errors, respectively, with $95$% confidence intervals as error bars over 50 trials. Training error remains near zero, while test error stabilizes around the noise level as $p$ increases.
Figure 3: A heatmap showing the mean test error for $\beta=0.001$ with the horizontal axis representing the dimension $p$ and the vertical axis representing the shape parameter $\gamma$.
Figure 4: A heatmap showing the mean test error for $\gamma=0.8$ with the horizontal axis representing the dimension $p$ and the vertical axis representing the learning rate $\beta$.
Figure 5: A heatmap showing the mean test error for $p=4000$ with the horizontal axis representing the shape parameter $\gamma$ and the vertical axis representing the learning rate $\beta$.
...and 1 more figures

Theorems & Definitions (45)

Definition 1: $\alpha$ sub-exponential random variable, sambale2023some
Example 2: Generalized normal distribution
Example 3: Generalized noisy rare-weak model
Theorem 4
Corollary 5
Proposition 6: A bound of the singular values of $X$
Corollary 7
Corollary 8
Corollary 9
Definition 10
...and 35 more

Benign Overfitting under Learning Rate Conditions for $α$ Sub-exponential Input

TL;DR

Abstract

Benign Overfitting under Learning Rate Conditions for $α$ Sub-exponential Input

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (45)