Table of Contents
Fetching ...

The Implicit Bias of Logit Regularization

Alon Beck, Yohai Bar Sinai, Noam Levi

TL;DR

This work analyzes convex logit regularization, including label smoothing, in linear classifiers and shows that such penalties induce logit clustering around finite targets $z^*$. In Gaussian data or when per-sample losses are quadratic, the optimal weight direction aligns with Fisher's Linear Discriminant, $\boldsymbol{S}\propto \Sigma^{-1}\boldsymbol{\mu}$, and the generalization performance becomes largely insensitive to the exact regularizer form. The authors reveal a shifted interpolation threshold to $\lambda_c=1$ in noiseless-feature regimes and uncover grokking dynamics for weak regularization, along with a proof that optimal generalization is invariant to orthogonal noise scale $\sigma_n$. Empirical validation on Gaussian data and neural-network penultimate embeddings supports the theory and links soft-target regularization to classical discrimination geometry, highlighting the broad efficacy of logit-regularization methods beyond label smoothing.

Abstract

Logit regularization, the addition a convex penalty directly in logit space, is widely used in modern classifiers, with label smoothing as a prominent example. While such methods often improve calibration and generalization, their mechanism remains under-explored. In this work, we analyze a general class of such logit regularizers in the context of linear classification, and demonstrate that they induce an implicit bias of logit clustering around finite per-sample targets. For Gaussian data, or whenever logits are sufficiently clustered, we prove that logit clustering drives the weight vector to align exactly with Fisher's Linear Discriminant. To demonstrate the consequences, we study a simple signal-plus-noise model in which this transition has dramatic effects: Logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to noise. Our results extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods.

The Implicit Bias of Logit Regularization

TL;DR

This work analyzes convex logit regularization, including label smoothing, in linear classifiers and shows that such penalties induce logit clustering around finite targets . In Gaussian data or when per-sample losses are quadratic, the optimal weight direction aligns with Fisher's Linear Discriminant, , and the generalization performance becomes largely insensitive to the exact regularizer form. The authors reveal a shifted interpolation threshold to in noiseless-feature regimes and uncover grokking dynamics for weak regularization, along with a proof that optimal generalization is invariant to orthogonal noise scale . Empirical validation on Gaussian data and neural-network penultimate embeddings supports the theory and links soft-target regularization to classical discrimination geometry, highlighting the broad efficacy of logit-regularization methods beyond label smoothing.

Abstract

Logit regularization, the addition a convex penalty directly in logit space, is widely used in modern classifiers, with label smoothing as a prominent example. While such methods often improve calibration and generalization, their mechanism remains under-explored. In this work, we analyze a general class of such logit regularizers in the context of linear classification, and demonstrate that they induce an implicit bias of logit clustering around finite per-sample targets. For Gaussian data, or whenever logits are sufficiently clustered, we prove that logit clustering drives the weight vector to align exactly with Fisher's Linear Discriminant. To demonstrate the consequences, we study a simple signal-plus-noise model in which this transition has dramatic effects: Logit regularization halves the critical sample complexity and induces grokking in the small-noise limit, while making generalization robust to noise. Our results extend the theoretical understanding of label smoothing and highlight the efficacy of a broader class of logit-regularization methods.
Paper Structure (41 sections, 11 theorems, 59 equations, 20 figures)

This paper contains 41 sections, 11 theorems, 59 equations, 20 figures.

Key Result

Proposition 3.1

[proposition]prop:gaussian_data Let $\boldsymbol{x} \sim \mathcal{N}(\boldsymbol{\mu}_{x}, \Sigma_{x})$ with $\boldsymbol{\mu}_{x} \neq \mathbf{0} \in \mathbb{R}^d$, and consider a vector $\boldsymbol{S}\in\mathbb{R}^d$. The logit $z(\boldsymbol{S}) = \boldsymbol{S}^{T}\boldsymbol{x}$ is Gaussian wi

Figures (20)

  • Figure 1: Logit regularization induces clustering. Logit evolution for a linear classifier trained on Gaussian data (see \ref{['app:numerical_details']} for more details regarding the setup). The train (blue) and test (red) samples are visualized across three training epochs (early, middle, and final). Note that samples are classified correctly if $z>0$. Top ($\alpha=0$): The unregularized logits are pushed toward infinity to maximize margins, indicating overconfidence. Bottom ($\alpha > 0$): The regularized loss exhibits a distinct finite minimum, driving the logits to cluster tightly around a target value $z^*$ during training. Note the looser clustering on the test set due to the misalignment between the empirical noise direction and the true signal.
  • Figure 2: Change in the implicit bias and robustness to $\alpha$. Analysis of the optimal weights $\boldsymbol{S}_{\min}$ on Student's t-distributed data ($\nu \in \{2.1, 3, 20\}$). Left panel: Cosine similarity between $\boldsymbol{S}_{\min}$ and the feature-axis, where the data is drawn from a Student's t-distribution for several values of $\nu$. The dashed lines indicates the unregularized values ($\alpha=0$). We observe an abrupt shift from $\alpha=0$ to $\alpha>0$, followed by a plateau that becomes flatter as $\nu$ increases (approaching a Gaussian distribution). On black dashed line we plot for comparison the limiting LDA expected value, $\Sigma^{-1}\mu$, where $\mu$ and $\Sigma$ are the empirical mean and centered covariance. Right panel: The norm $\|\boldsymbol{S}_{\min}\|$, which, in contrast, exhibits a clear dependence on $\alpha$. See \ref{['app:numerical_details']} for more details about the numerical setup.
  • Figure 3: Effect of Input Noise on Feature Geometry.Top Row: Visualization of penultimate layer features for two classes (Planes vs. Cats) from ResNet-18. The horizontal axis represents the signal direction (connecting class means), while the vertical axis represents the effective orthogonal noise radius. Left: Clean CIFAR-10 data. Right: Noisy CIFAR-10C data. Bottom Row: The eigenspectrum of the orthogonal noise covariance matrix. For more details, see Appendix \ref{['app:numerical_details']}.
  • Figure 4: Perfect generalization with zero feature noise ($\sigma_f=0$).Top: Evolution of the logit distribution across three training epochs. At convergence, all logits collapse to a single target value $z^*$. Bottom: Loss and accuracy over time. Vertical dashed lines indicate the epochs corresponding to the top snapshots.
  • Figure 5: Grokking. Loss and accuracy curves for $\sigma_{f}=0$ and $\lambda=0.7$. While the unregularized model ($\alpha=0$) overfits, logit regularization induces grokking (delayed generalization), with the delay diverging as $\alpha$ approaches zero.
  • ...and 15 more figures

Theorems & Definitions (22)

  • Proposition 3.1: Gaussian Data
  • proof
  • Corollary 3.2: Quadratic Loss for any data
  • proof
  • Corollary 3.3: LDA direction
  • proof
  • Proposition 5.1
  • proof
  • Proposition 6.1
  • proof
  • ...and 12 more