Table of Contents
Fetching ...

Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints

Shengping Xie, Zekun Wu, Quan Chen, Kaixu Tang

Abstract

Implicit bias induced by gradient-based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent} (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry-aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low-rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD-free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini-batch sampling and momentum modulate the convergence toward the expected maximum margin solutions.Our code is accessible at: https://github.com/Tsokarsic/observing-the-implicit-bias-on-multiclass-seperable-data.

Towards The Implicit Bias on Multiclass Separable Data Under Norm Constraints

Abstract

Implicit bias induced by gradient-based algorithms is essential to the generalization of overparameterized models, yet its mechanisms can be subtle. This work leverages the Normalized Steepest Descent} (NSD) framework to investigate how optimization geometry shapes solutions on multiclass separable data. We introduce NucGD, a geometry-aware optimizer designed to enforce low rank structures through nuclear norm constraints. Beyond the algorithm itself, we connect NucGD with emerging low-rank projection methods, providing a unified perspective. To enable scalable training, we derive an efficient SVD-free update rule via asynchronous power iteration. Furthermore, we empirically dissect the impact of stochastic optimization dynamics, characterizing how varying levels of gradient noise induced by mini-batch sampling and momentum modulate the convergence toward the expected maximum margin solutions.Our code is accessible at: https://github.com/Tsokarsic/observing-the-implicit-bias-on-multiclass-seperable-data.
Paper Structure (15 sections, 4 theorems, 21 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 15 sections, 4 theorems, 21 equations, 8 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

For multiclass linear model defined in equation lossfunction with separable data, under basic assumptions, when the step size is taken of order $\sqrt{1/t}$, the relative margin of the weight matrix equation relativemargin follow the iterate of NSD will converge to the maximum relative margin:

Figures (8)

  • Figure 1: Correlations and spectrum for max margin solutions under different norm
  • Figure 2: The Weight Heatmap of max margin solutions under different norm
  • Figure 3: NucGD experiments: (a) singular value spectra of max-margin solutions under different norms; (b) normalized-margin error along NucGD measured against different max-margin references; (c) correlations between NucGD iterates and different max-margin directions; (d) singular value spectra of final solutions produced by different NSD algorithms.
  • Figure 4: Effect of noise controlled by batch size $B$
  • Figure 5: Effect of momentum weight $\mu$, mini-batch training
  • ...and 3 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Theorem 2
  • proof
  • Remark 1
  • proof
  • Lemma 1
  • proof
  • Theorem 3
  • proof
  • Remark 2