Achieving Margin Maximization Exponentially Fast via Progressive Norm Rescaling
Mingze Wang, Zeping Min, Lei Wu
TL;DR
This work analyzes why gradient-based methods converge to max-margin solutions at slow rates and identifies centripetal velocity as the bottleneck. It introduces Progressive Rescaling Gradient Descent (PRGD), which alternates norm-rescaling and projected normalized gradient steps to stay in regions with favorable velocity, yielding exponential convergence in both directional alignment and margin. The theoretical results show PRGD dramatically outperforms GD and NGD under mild data-distribution assumptions, and empirical tests on synthetic, tabular, and deep-network settings corroborate improved margin attainment and generalization. The approach offers a practical route to faster implicit-margin optimization and suggests potential gains by combining PRGD with existing regularization techniques on real-world models.
Abstract
In this work, we investigate the margin-maximization bias exhibited by gradient-based algorithms in classifying linearly separable data. We present an in-depth analysis of the specific properties of the velocity field associated with (normalized) gradients, focusing on their role in margin maximization. Inspired by this analysis, we propose a novel algorithm called Progressive Rescaling Gradient Descent (PRGD) and show that PRGD can maximize the margin at an {\em exponential rate}. This stands in stark contrast to all existing algorithms, which maximize the margin at a slow {\em polynomial rate}. Specifically, we identify mild conditions on data distribution under which existing algorithms such as gradient descent (GD) and normalized gradient descent (NGD) {\em provably fail} in maximizing the margin efficiently. To validate our theoretical findings, we present both synthetic and real-world experiments. Notably, PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
