Universality of Benign Overfitting in Binary Linear Classification

Ichiro Hashimoto; Stanislav Volgushev; Piotr Zwiernik

Universality of Benign Overfitting in Binary Linear Classification

Ichiro Hashimoto, Stanislav Volgushev, Piotr Zwiernik

TL;DR

This work provides a comprehensive study of benign overfitting for linear maximum margin classifiers and discovers a phase transition in test error bounds for the noisy model which was previously unknown and provides some geometric intuition behind it.

Abstract

The practical success of deep learning has led to the discovery of several surprising phenomena. One of these phenomena, that has spurred intense theoretical research, is ``benign overfitting'': deep neural networks seem to generalize well in the over-parametrized regime even though the networks show a perfect fit to noisy training data. It is now known that benign overfitting also occurs in various classical statistical models. For linear maximum margin classifiers, benign overfitting has been established theoretically in a class of mixture models with very strong assumptions on the covariate distribution. However, even in this simple setting, many questions remain open. For instance, most of the existing literature focuses on the noiseless case where all true class labels are observed without errors, whereas the more interesting noisy case remains poorly understood. We provide a comprehensive study of benign overfitting for linear maximum margin classifiers. We discover a phase transition in test error bounds for the noisy model which was previously unknown and provide some geometric intuition behind it. We further considerably relax the required covariate assumptions in both, the noisy and noiseless case. Our results demonstrate that benign overfitting of maximum margin classifiers holds in a much wider range of scenarios than was previously known and provide new insights into the underlying mechanisms.

Universality of Benign Overfitting in Binary Linear Classification

TL;DR

Abstract

Paper Structure (37 sections, 45 theorems, 277 equations, 5 figures, 1 table)

This paper contains 37 sections, 45 theorems, 277 equations, 5 figures, 1 table.

Introduction
Setting and Notations
Main Results
Test error bounds and benign overfitting in the noiseless case
Test error bounds and benign overfitting in the noisy model
Proof Sketch of Theorem \ref{['thm:noiseless-main']} and Theorem \ref{['detail-noisy-main-1-simple']}
Geometry behind Over-parametrization & Phase Transition
Blow up phenomenon
Phase Transition: Noiseless Model
Phase Transition: Noisy Model
Recovering existing results for sub-Gaussian mixture models
Overview of supplement
General results and proofs of results in model \ref{['model:M']}
Bounding various quantities related to the gram matrix on events E1,...,E5
General Lower and upper bounds on the test error in a special case
...and 22 more sections

Key Result

Theorem 2.1

When the dataset is linearly separable ($\exists \boldsymbol{w}\in \mathbb{R}^p$ such that $\langle \boldsymbol{w}, y_i\boldsymbol{x}_i\rangle >0$ for all $i$), linear classifier optimized by gradient descent eq:gdsc, with sufficiently small step size $a$, converges in direction to the maximum margi

Figures (5)

Figure 1: The observations $\bar{\boldsymbol{x}}_i = y_i\boldsymbol{x}_i$ are concentrated near the spheres $\pm \boldsymbol{\mu}+\rho^{-1/2}S^{p-1}$. Except for the noiseless & strong signal case, a non-negligible proportion of the sphere $\boldsymbol{\mu} + \rho^{-1/2}S^{p-1}$ seems to lie outside of the shaded half-space.
Figure 2: Blow-up phenomenon. The observations $y_i\boldsymbol{x}_i$ are concentrated around the sphere $\boldsymbol{\mu}+\rho^{-1/2}S^{p-1}$. If $\tfrac{\langle\boldsymbol{\hat{w}}, \boldsymbol{\mu}\rangle}{\|\boldsymbol{\hat{w}}\|}$ is big enough, a large proportion of the blue sphere lies in the shaded half-space.
Figure 3: The observations $\bar{\boldsymbol{x}}_i=\boldsymbol{\mu}+\bar{\boldsymbol{z}}_i$ are concentrated on average around the sphere $\boldsymbol{\mu}+\rho^{-1/2}S^{p-1}$. Since $\bar{\boldsymbol{z}}_i$'s are all nearly orthogonal to $\boldsymbol{\mu}$, the data points actually concentrate around a smaller area depicted in blue. Since $n<\!\!\!< p$ they fill only a small subarea of this region and they all line on the hyperplane $\langle\hat{\boldsymbol{w}},\boldsymbol{u}\rangle=1$. The decomposition $\frac{\hat{\boldsymbol{w}}}{\|\hat{\boldsymbol{w}}\|^2}=\boldsymbol{\mu}+\boldsymbol{z}_\perp$ depicted here is the fundamental geometric reason behind the phase transition.
Figure 4: Illustration of $\boldsymbol{z}_\perp$ as a convex combination of $\bar{\boldsymbol{z}}_i$'s or, equivalently, as an orthogonal projection of the origin on the maximum margin hyperplane defined by them.
Figure 5: The clean observations $\bar{\boldsymbol{x}}_i=\boldsymbol{\mu}+\bar{\boldsymbol{z}}_i$ are concentrated around the sphere $\boldsymbol{\mu}+\rho^{-1/2}S^{p-1}$ and the noisy observations concentrate around $-\boldsymbol{\mu}+\rho^{-1/2}S^{p-1}$. The decomposition $\frac{\hat{\boldsymbol{w}}}{\|\hat{\boldsymbol{w}}\|^2}=\nu_\textsc{c}(\boldsymbol{\mu}+\boldsymbol{z}_{\perp,\textsc{c}})+\nu_\textsc{n}(-\boldsymbol{\mu}+\boldsymbol{z}_{\perp,\textsc{n}})$ depicted here is the fundamental geometric reason behind the phase transition in the noisy model.

Theorems & Definitions (87)

Theorem 2.1: Theorem 3, soudry2022implicit
Lemma 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4
Theorem 3.5
Theorem 3.6
Theorem 3.7
Corollary 3.8
Theorem 3.9
...and 77 more

Universality of Benign Overfitting in Binary Linear Classification

TL;DR

Abstract

Universality of Benign Overfitting in Binary Linear Classification

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (87)