Universality of max-margin classifiers

Andrea Montanari; Feng Ruan; Basil Saeed; Youngtak Sohn

Universality of max-margin classifiers

Andrea Montanari, Feng Ruan, Basil Saeed, Youngtak Sohn

TL;DR

This work establishes a universality principle for max-margin classification in high dimensions with nonlinear randomized or independent feature maps. By linking non-Gaussian feature models to Gaussian equivalents through second-moments and covariance structures, the authors reduce analysis to Gaussian settings and obtain stable margins and test errors as dimensionality grows. A duality-based strategy reveals that the max-margin can be treated as an average over a linear number of support vectors, enabling transfer of Gaussian results via delocalization and ERM-like reformulations. The results imply that overparameterization thresholds and generalization behavior, including benign overfitting, are governed by the same spectral properties across a broad class of feature maps. This provides a rigorous bridge between non-Gaussian randomized features and classical Gaussian analyses, with concrete implications for understanding high-dimensional classification in practice.

Abstract

Maximum margin binary classification is one of the most fundamental algorithms in machine learning, yet the role of featurization maps and the high-dimensional asymptotics of the misclassification error for non-Gaussian features are still poorly understood. We consider settings in which we observe binary labels $y_i$ and either $d$-dimensional covariates ${\boldsymbol z}_i$ that are mapped to a $p$-dimension space via a randomized featurization map ${\boldsymbol φ}:\mathbb{R}^d \to\mathbb{R}^p$, or $p$-dimensional features of non-Gaussian independent entries. In this context, we study two fundamental questions: $(i)$ At what overparametrization ratio $p/n$ do the data become linearly separable? $(ii)$ What is the generalization error of the max-margin classifier? Working in the high-dimensional regime in which the number of features $p$, the number of samples $n$ and the input dimension $d$ (in the nonlinear featurization setting) diverge, with ratios of order one, we prove a universality result establishing that the asymptotic behavior is completely determined by the expected covariance of feature vectors and by the covariance between features and labels. In particular, the overparametrization threshold and generalization error can be computed within a simpler Gaussian model. The main technical challenge lies in the fact that max-margin is not the maximizer (or minimizer) of an empirical average, but the maximizer of a minimum over the samples. We address this by representing the classifier as an average over support vectors. Crucially, we find that in high dimensions, the support vector count is proportional to the number of samples, which ultimately yields universality.

Universality of max-margin classifiers

TL;DR

Abstract

and either

-dimensional covariates

that are mapped to a

-dimension space via a randomized featurization map

, or

-dimensional features of non-Gaussian independent entries. In this context, we study two fundamental questions:

At what overparametrization ratio

do the data become linearly separable?

What is the generalization error of the max-margin classifier? Working in the high-dimensional regime in which the number of features

, the number of samples

and the input dimension

(in the nonlinear featurization setting) diverge, with ratios of order one, we prove a universality result establishing that the asymptotic behavior is completely determined by the expected covariance of feature vectors and by the covariance between features and labels. In particular, the overparametrization threshold and generalization error can be computed within a simpler Gaussian model. The main technical challenge lies in the fact that max-margin is not the maximizer (or minimizer) of an empirical average, but the maximizer of a minimum over the samples. We address this by representing the classifier as an average over support vectors. Crucially, we find that in high dimensions, the support vector count is proportional to the number of samples, which ultimately yields universality.

Paper Structure (66 sections, 48 theorems, 289 equations)

This paper contains 66 sections, 48 theorems, 289 equations.

Introduction
Motivation and overview of results
Technical innovation
Further related work
Main results
Random features model and its Gaussian equivalent
Independent features model and its Gaussian equivalent
Universality theorems
Implications for benign overfitting
Proofs of main results
Proof of Theorem \ref{['theorem:universality-of-the-margin']}: Universality of Max-Margin
The max-margin as an average over support vectors
Relation to empirical risk minimzation
Universality of the margin
Upper bound.
...and 51 more sections

Key Result

Theorem 1

Consider either $({\boldsymbol X},{\boldsymbol G})=({\boldsymbol X}_{{\sf RF}},{\boldsymbol G}_{{\sf RF}})$ under Assumption assumption:RF or $({\boldsymbol X},{\boldsymbol G})=({\boldsymbol X}_{\sf ind},{\boldsymbol G}_{\sf ind})$ under Assumption assumption:ind. Then, for any Lipschitz function $\

Theorems & Definitions (82)

Remark 3.1
Remark 3.2
Remark 3.3
Theorem 1: Universality of the margin
Theorem 2: Universality of the test error
Corollary 1
Lemma 1: Restricted Strong Convexity
Proposition 1
Proposition 2
Remark 4.1
...and 72 more

Universality of max-margin classifiers

TL;DR

Abstract

Universality of max-margin classifiers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (82)