Table of Contents
Fetching ...

The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks

Zice Wang

TL;DR

These findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

Abstract

While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d << D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks

TL;DR

These findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.

Abstract

While implicit regularization facilitates benign overfitting in low-noise regimes, recent theoretical work predicts a sharp phase transition to harmful overfitting as the noise-to-signal ratio increases. We experimentally isolate the geometric mechanism of this transition: the Malignant Tail, a failure mode where networks functionally segregate signal and noise, reducing coherent semantic features into low-rank subspaces while pushing stochastic label noise into high-frequency orthogonal components, distinct from systematic or corruption-aligned noise. Through a Spectral Linear Probe of training dynamics, we demonstrate that Stochastic Gradient Descent (SGD) fails to suppress this noise, instead implicitly biasing it toward high-frequency orthogonal subspaces, effectively preserving signal-noise separability. We show that this geometric separation is distinct from simple variance reduction in untrained models. In trained networks, SGD actively segregates noise, allowing post-hoc Explicit Spectral Truncation (d << D) to surgically prune the noise-dominated subspace. This approach recovers the optimal generalization capability latent in the converged model. Unlike unstable temporal early stopping, Geometric Truncation provides a stable post-hoc intervention. Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability that allows for noise memorization, necessitating explicit rank constraints to filter stochastic corruptions for robust generalization.
Paper Structure (70 sections, 3 theorems, 20 equations, 16 figures)

This paper contains 70 sections, 3 theorems, 20 equations, 16 figures.

Key Result

Theorem 3.3

Let $\mathcal{E}(d)$ be the excess risk of the minimum-norm linear interpolator (ridgeless limit) constrained to the subspace spanned by the top-$d$ principal components. Under Assumption ass:spectral_sep, the error decomposes as: The term $\mathcal{B}_d$ decays according to the power-law spectral decay of the signal manifold. Conversely, the term $\mathcal{V}_d$ grows linearly with $d$ once the

Figures (16)

  • Figure 1: The Geometry of Robustness. (Left) Standard training allows the representation to expand into high-frequency dimensions to fit noise (The Malignant Tail). (Right) Aggressive compression collapses distinct semantic classes. (Center) The Optimal Spectral Efficiency zone aligns the representation rank with the data's intrinsic dimension, filtering noise while preserving semantics.
  • Figure 2: Geometry of the Malignant Tail. Analytical heatmap of Log Test Error ($\log R(d)$) under the Spiked Covariance model ($k^*=10$). The horizontal blue valley at $d \approx k^*$ represents the safe subspace. The top-left quadrant ($d \gg N$) illustrates the failure of over-parameterization. Our post-hoc truncation forces the model back into the blue valley.
  • Figure 3: Universality of Spectral Failure. Generalization error vs. Dimension ($d$) for Linear Regression and ReLU MLP. Both models achieve optimal risk at $d=k^*$ and degrade identically as $d$ increases, confirming that non-linear architectures are equally susceptible to the Malignant Tail phenomenon.
  • Figure 4: The Geometric Fingerprint of the Malignant Tail. (a) Validation accuracy on ResNet-18 (CIFAR-100, 20% Noise) peaks at the intrinsic dimension ($d \approx 51$) before degrading as the probe enters the spectral tail. (b) Our Dual-Manifold Probe (Procrustes alignment with a Clean Oracle) confirms the cause: while leading eigenvectors align with the clean signal ($\rho \approx 1$), the tail components responsible for the accuracy drop are functionally orthogonal to the true semantic manifold.
  • Figure 5: Visualization of Subspace Semantics. Projections of validation data onto the principal components of a ResNet-18 (CIFAR-10, 20% Noise). (Left) Signal Subspace (PC 1-2): Captures semantic class separation. (Right) Noise Subspace (PC 60-61): The dimensions recruited during the "Malignant" phase exhibit isotropic clustering, confirming they store minimal semantic information.
  • ...and 11 more figures

Theorems & Definitions (11)

  • Definition 3.1: Effective Rank via Spectral Entropy
  • Theorem 3.3: Intrinsic Rank-Risk Convexity
  • proof
  • Proposition 3.4: Geometric Optimality of Truncation
  • proof
  • Remark 3.5: Relation to Neural Collapse
  • Proposition 4.1: Critical Stopping Time
  • Remark 4.2: Extension to Adaptive Methods
  • proof
  • proof
  • ...and 1 more