Table of Contents
Fetching ...

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

Mingze Wang, Zeping Min, Lei Wu

TL;DR

This work analyzes why gradient-based methods converge to max-margin solutions at slow rates and identifies centripetal velocity as the bottleneck. It introduces Progressive Rescaling Gradient Descent (PRGD), which alternates norm-rescaling and projected normalized gradient steps to stay in regions with favorable velocity, yielding exponential convergence in both directional alignment and margin. The theoretical results show PRGD dramatically outperforms GD and NGD under mild data-distribution assumptions, and empirical tests on synthetic, tabular, and deep-network settings corroborate improved margin attainment and generalization. The approach offers a practical route to faster implicit-margin optimization and suggests potential gains by combining PRGD with existing regularization techniques on real-world models.

Abstract

In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.

Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling

TL;DR

This work analyzes why gradient-based methods converge to max-margin solutions at slow rates and identifies centripetal velocity as the bottleneck. It introduces Progressive Rescaling Gradient Descent (PRGD), which alternates norm-rescaling and projected normalized gradient steps to stay in regions with favorable velocity, yielding exponential convergence in both directional alignment and margin. The theoretical results show PRGD dramatically outperforms GD and NGD under mild data-distribution assumptions, and empirical tests on synthetic, tabular, and deep-network settings corroborate improved margin attainment and generalization. The approach offers a practical route to faster implicit-margin optimization and suggests potential gains by combining PRGD with existing regularization techniques on real-world models.

Abstract

In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
Paper Structure (25 sections, 16 theorems, 141 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 16 theorems, 141 equations, 6 figures, 2 tables, 1 algorithm.

Key Result

Proposition 4.1

Consider Dataset ass: dataset: 3data not intersect. Then NGD equ: NGD can only maximize the margin polynomially fast, while PRGD (Alg. alg: PRGD) can maximize the margin exponentially fast. Specifically,

Figures (6)

  • Figure 1: (a) A visualization of Dataset \ref{['ass: dataset: 3data not intersect']} where $\boldsymbol{w}^\star$ is the max-margin solution. (b) The vector field and the trajectories of NGD and PRGD for Dataset \ref{['ass: dataset: 3data not intersect']}. The gray arrows plot the vector field $-\nabla\mathcal{L}(\cdot)/\left\| \nabla\mathcal{L}(\cdot) \right\|$; the red dashed line corresponds to the max-margin solution $\boldsymbol{w}^\star$; the green zone$\mathbb{A}$ is an "attractor" of NGD dynamics. We plot the trajectories of PPGD and NGD for 8 iterations starting from the same initial point $\boldsymbol{w}(1)$ (black), where $\boldsymbol{w}(1)$ is trained by NGD starting from $\boldsymbol{w}(0)=\mathbf{0}$ (black).
  • Figure 2: A visual illustration of Definition \ref{['def: centripetal angular velocity']} in $\mathbb{R}^3$. The red arrow corresponds to the max-margin direction $\boldsymbol{w}^\star$. At $\boldsymbol{w}\in\mathbb{R}^3$, the purple arrow signifies the normalized negative gradient; the orange arrow depicts the projection of $-\nabla\mathcal{L}(\boldsymbol{w})/\mathcal{L}(\boldsymbol{w})$ along the centripetal direction $-\mathcal{P}_{\perp}(\boldsymbol{w})/\left\| \mathcal{P}_{\perp}(\boldsymbol{w}) \right\|$, reflecting the centripetal velocity$\varphi(\boldsymbol{w})$.
  • Figure 3: Comparison of margin maximization rates of different algorithms on a synthetic dataset. (left) A visualization of the 2d synthetic dataset. The yellow points represent the data with label $1$, while the purple points corresponds to the data with label $1$; (middle)(right) The comparison of margin maximization rates of different algorithms on this dataset at small and large time scales, respectively.
  • Figure 4: Comparison of margin maximization rates of different algorithms on digit (real-word) datasets. (Left) the results on digit-01 dataset; (Right) the results on digit-04 dataset.
  • Figure 5: Comparison of the generalization performance of GD, NGD, and PRGD for non-linearly separable datasets and deep neural networks.
  • ...and 1 more figures

Theorems & Definitions (31)

  • Proposition 4.1
  • Definition 5.1
  • Definition 5.2: Centripetal Velocity
  • Definition 5.3: Semi-infinite Hollow Cylinder
  • Theorem 5.5: Centripetal Velocity Analysis, Main result
  • Theorem 6.1: PRGD, Main Result
  • Remark 6.2
  • Remark 6.3
  • Theorem 6.4: GD and NGD, Main Results
  • proof : Proof of Proposition \ref{['thm: 3data']}
  • ...and 21 more