Tractability from overparametrization: The example of the negative perceptron
Andrea Montanari, Yiqiao Zhong, Kangjie Zhou
TL;DR
This work analyzes a nonconvex negative-margin linear classifier in high dimensions under two data models: pure noise (random labels) and labels correlated with a linear signal. It introduces interpolation (δ_s) and algorithmic (δ_alg) thresholds in the proportional n/d regime and uses second-moment methods and Gordon's Gaussian comparison to bound existence, while a linear-programming surrogate yields a tractable algorithmic threshold δ_lin. The results reveal a gap between δ_s and δ_lin, show how the thresholds depend on κ and on an exponential-tail link φ in the linear-signal case, and connect these thresholds to the geometry of random polytopes via the radius Rd of a random polytope. The paper further explores gradient-descent alternatives and reports numerical experiments, highlighting the potential for faster optimization in highly overparameterized regimes and motivating future work on sharper thresholds and algorithmic design. Altogether, the work provides a rigorous foundation for tractability from overparametrization in a simple nonconvex model, tying learning, optimization, and high-dimensional geometry together with concrete asymptotics and phase diagrams.
Abstract
In the negative perceptron problem we are given $n$ data points $({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector ${\boldsymbol θ}$ that maximizes $\min_{i\le n}y_i\langle {\boldsymbol θ},{\boldsymbol x}_i\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\to \infty$ with $n/d\toδ$, and prove upper and lower bounds on the maximum margin $κ_{\text{s}}(δ)$ or -- equivalently -- on its inverse function $δ_{\text{s}}(κ)$. In other words, $δ_{\text{s}}(κ)$ is the overparametrization threshold: for $n/d\le δ_{\text{s}}(κ)-\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n/d\ge δ_{\text{s}}(κ)+\varepsilon$ it does not. Our bounds on $δ_{\text{s}}(κ)$ match to the leading order as $κ\to -\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $δ_{\text{lin}}(κ)$. We observe a gap between the interpolation threshold $δ_{\text{s}}(κ)$ and the linear programming threshold $δ_{\text{lin}}(κ)$, raising the question of the behavior of other algorithms.
