Table of Contents
Fetching ...

Tractability from overparametrization: The example of the negative perceptron

Andrea Montanari, Yiqiao Zhong, Kangjie Zhou

TL;DR

This work analyzes a nonconvex negative-margin linear classifier in high dimensions under two data models: pure noise (random labels) and labels correlated with a linear signal. It introduces interpolation (δ_s) and algorithmic (δ_alg) thresholds in the proportional n/d regime and uses second-moment methods and Gordon's Gaussian comparison to bound existence, while a linear-programming surrogate yields a tractable algorithmic threshold δ_lin. The results reveal a gap between δ_s and δ_lin, show how the thresholds depend on κ and on an exponential-tail link φ in the linear-signal case, and connect these thresholds to the geometry of random polytopes via the radius Rd of a random polytope. The paper further explores gradient-descent alternatives and reports numerical experiments, highlighting the potential for faster optimization in highly overparameterized regimes and motivating future work on sharper thresholds and algorithmic design. Altogether, the work provides a rigorous foundation for tractability from overparametrization in a simple nonconvex model, tying learning, optimization, and high-dimensional geometry together with concrete asymptotics and phase diagrams.

Abstract

In the negative perceptron problem we are given $n$ data points $({\boldsymbol x}_i,y_i)$, where ${\boldsymbol x}_i$ is a $d$-dimensional vector and $y_i\in\{+1,-1\}$ is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector ${\boldsymbol θ}$ that maximizes $\min_{i\le n}y_i\langle {\boldsymbol θ},{\boldsymbol x}_i\rangle$. This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which $n,d\to \infty$ with $n/d\toδ$, and prove upper and lower bounds on the maximum margin $κ_{\text{s}}(δ)$ or -- equivalently -- on its inverse function $δ_{\text{s}}(κ)$. In other words, $δ_{\text{s}}(κ)$ is the overparametrization threshold: for $n/d\le δ_{\text{s}}(κ)-\varepsilon$ a classifier achieving vanishing training error exists with high probability, while for $n/d\ge δ_{\text{s}}(κ)+\varepsilon$ it does not. Our bounds on $δ_{\text{s}}(κ)$ match to the leading order as $κ\to -\infty$. We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold $δ_{\text{lin}}(κ)$. We observe a gap between the interpolation threshold $δ_{\text{s}}(κ)$ and the linear programming threshold $δ_{\text{lin}}(κ)$, raising the question of the behavior of other algorithms.

Tractability from overparametrization: The example of the negative perceptron

TL;DR

This work analyzes a nonconvex negative-margin linear classifier in high dimensions under two data models: pure noise (random labels) and labels correlated with a linear signal. It introduces interpolation (δ_s) and algorithmic (δ_alg) thresholds in the proportional n/d regime and uses second-moment methods and Gordon's Gaussian comparison to bound existence, while a linear-programming surrogate yields a tractable algorithmic threshold δ_lin. The results reveal a gap between δ_s and δ_lin, show how the thresholds depend on κ and on an exponential-tail link φ in the linear-signal case, and connect these thresholds to the geometry of random polytopes via the radius Rd of a random polytope. The paper further explores gradient-descent alternatives and reports numerical experiments, highlighting the potential for faster optimization in highly overparameterized regimes and motivating future work on sharper thresholds and algorithmic design. Altogether, the work provides a rigorous foundation for tractability from overparametrization in a simple nonconvex model, tying learning, optimization, and high-dimensional geometry together with concrete asymptotics and phase diagrams.

Abstract

In the negative perceptron problem we are given data points , where is a -dimensional vector and is a binary label. The data are not linearly separable and hence we content ourselves to find a linear classifier with the largest possible \emph{negative} margin. In other words, we want to find a unit norm vector that maximizes . This is a non-convex optimization problem (it is equivalent to finding a maximum norm vector in a polytope), and we study its typical properties under two random models for the data. We consider the proportional asymptotics in which with , and prove upper and lower bounds on the maximum margin or -- equivalently -- on its inverse function . In other words, is the overparametrization threshold: for a classifier achieving vanishing training error exists with high probability, while for it does not. Our bounds on match to the leading order as . We then analyze a linear programming algorithm to find a solution, and characterize the corresponding threshold . We observe a gap between the interpolation threshold and the linear programming threshold , raising the question of the behavior of other algorithms.

Paper Structure

This paper contains 41 sections, 46 theorems, 539 equations, 6 figures.

Key Result

Theorem 4.1

If $\delta < \delta_{\sl}(\kappa)$, then with high probability there is a $\kappa$-margin solution, i.e., Moreover, $\delta_{\sl}(\kappa)= (1+o_{\kappa}(1))(\log |\kappa|)/\Phi(\kappa)$ as $\kappa \to -\infty$. Hence, for any $\varepsilon > 0$, there exists a $\underline \kappa = \underline \kappa(\varepsilon) < 0$, such that for all $\kappa < \underline \kappa$,

Figures (6)

  • Figure 1: Phase diagram for the negative perceptron. The 'replica symmetric' prediction $\delta_{\mathrm{RS}} (\kappa)$ coincides with the satisfiability threshold $\delta_{\hbox{\scriptsize\rm s}}(\kappa)$ for $\kappa \ge 0$ but is only an upper bound for $\kappa < 0$. Our improved $\delta_{\hbox{\scriptsize\rm ub}} (\kappa)$ is strictly better than $\delta_{\mathrm{RS}} (\kappa)$ for all negative values of $\kappa$. The lower bound $\delta_{\sl} (\kappa)$ is inferior to the linear programming threshold $\delta_{\hbox{\scriptsize\rm lin}} (\kappa)$ (which is a lower bound on $\delta_{\hbox{\scriptsize\rm alg}} (\kappa)$; see its precise definition in Eq. \ref{['cond:delta1']}) when $\vert \kappa \vert$ is small. As $\kappa$ decreases, $\delta_{\sl} (\kappa)$ surpasses $\delta_{\hbox{\scriptsize\rm lin}} (\kappa)$. Finally, $\delta_{\sl}(\kappa)$, $\delta_{\hbox{\scriptsize\rm s}}(\kappa)$ and $\delta_{\hbox{\scriptsize\rm ub}}(\kappa)$ become asymptotically equivalent as $\kappa \to -\infty$. The phase transition for the existence of $\kappa$-margin solution occurs in the region delimited by $\max \{\delta_{\sl} (\kappa),\delta_{\hbox{\scriptsize\rm lin}} (\kappa) \}$ and $\delta_{\hbox{\scriptsize\rm ub}} (\kappa)$ (gray area).
  • Figure 2: Theoretical predictions of the phase transition thresholds and the linear programming lower bound for labels correlated to a linear signal. Here the link function is logistic: $\varphi(t) = (1+e^{-t})^{-1}$. The phase transition for the existence of $\kappa$-margin solution occurs in the region delimited by $\delta_{\sl}(\kappa;\varphi)$ and $\delta_{\hbox{\scriptsize\rm ub}} (\kappa;\varphi)$ (gray area).
  • Figure 3: Maximizing the function $M(\rho,r)$ as defined in \ref{['def:Mrho']} over $\rho \in [-1,1], r \in [0,1]$ numerically gives the maximizer $(\rho^*, r^*)$. The heatmaps show the values of $r^*$ (left) and $\rho^*$ (right) under varying $\kappa$ and $\delta$. Left: Yellow region indicates the regime where a $\kappa$-margin solution exists. Right:$\rho^*$ gives the asymptotic correlation $\langle \widehat{\text{\boldmath $\theta$}}, \text{\boldmath $\theta$}_* \rangle$.
  • Figure 4: Asymptotic predictions of the estimation error among $k$-margin solutions $\text{\boldmath $\theta$}\in{\rm ERM}_0(\kappa)$, as a function of $\kappa$ for $\delta = 1.5^{15}$. Here we use the logistic link function $\varphi(t)=(1+e^{-t})^{-1}$. The estimation error of any $\kappa$-margin solution lies in the gray region. The linear programming algorithm of Section \ref{['sec:AlgoNRLabels']} achieves estimation error reported as the blue curve (namely $\sqrt{2( 1 - \rho^*)}$ for $\rho^*$ introduced in Theorem \ref{['thm:err']} (a)). Vertical dashed lines correspond to $\kappa_{\hbox{\scriptsize\rm lin}}(\delta) :=\sup\{\kappa:\; \delta<\delta_{\hbox{\scriptsize\rm lin}}(\kappa; \varphi)\}$ (left vertical line) and $\kappa_{\hbox{\scriptsize\rm ub}}(\delta) :=\sup\{\kappa:\; \delta<\delta_{\hbox{\scriptsize\rm ub}}(\kappa; \varphi)\}$ (right vertical line).
  • Figure 5: Scatter plots of the empirical probability of finding a $\kappa$-margin interpolator using gradient descent (GD) and linear programming (LP), as a function of $\delta$. We fix $\kappa = -1.5$ and choose $d \in \{50, 100, 200\}$. The vertical solid lines represent $\delta_{\sl} (\kappa)$, $\delta_{\hbox{\scriptsize\rm lin}} (\kappa)$ and $\delta_{\hbox{\scriptsize\rm ub}} (\kappa)$ (from left to right) respectively. Note that for $\kappa = -1.5$, we actually have $\delta_{\sl} (\kappa) < \delta_{\hbox{\scriptsize\rm lin}} (\kappa)$.
  • ...and 1 more figures

Theorems & Definitions (94)

  • Remark 1.1
  • Definition 1
  • Theorem 4.1
  • Definition 2
  • Theorem 4.2
  • Theorem 4.3
  • Remark 4.1
  • Remark 5.1
  • Definition 3
  • Theorem 5.1
  • ...and 84 more