Table of Contents
Fetching ...

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

Tongtong Liang, Dan Qiao, Yu-Xiang Wang, Rahul Parhi

TL;DR

This paper addresses why flat minima of two-layer ReLU networks fail to generalize in high dimensions despite gradient descent stability. It develops a data-dependent weighted variation framework, $\mathrm{V}_g$, and a Radon-domain characterization to connect stability to function-space regularity, then derives upper and lower bounds for generalization and nonparametric MSE in the high-dimensional, non-interpolation regime. A key contribution is a novel ReLU-specific minimax lower bound based on boundary-localized neurons, which formalizes the neural shattering phenomenon and the curse of dimensionality for stable minima. Experiments on synthetic data corroborate the theory, showing that high learning rates yield sparse activation and poor generalization in high dimensions, while weight decay improves activation coverage and performance. Overall, the work provides a principled explanation for the limitations of flat-minima biases in high-dimensional neural networks and highlights fundamental trade-offs between stability, regularity, and extrapolation.

Abstract

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-à-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ''neural shattering'' where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon

TL;DR

This paper addresses why flat minima of two-layer ReLU networks fail to generalize in high dimensions despite gradient descent stability. It develops a data-dependent weighted variation framework, , and a Radon-domain characterization to connect stability to function-space regularity, then derives upper and lower bounds for generalization and nonparametric MSE in the high-dimensional, non-interpolation regime. A key contribution is a novel ReLU-specific minimax lower bound based on boundary-localized neurons, which formalizes the neural shattering phenomenon and the curse of dimensionality for stable minima. Experiments on synthetic data corroborate the theory, showing that high learning rates yield sparse activation and poor generalization in high dimensions, while weight decay improves activation coverage and performance. Overall, the work provides a principled explanation for the limitations of flat-minima biases in high-dimensional neural networks and highlights fundamental trade-offs between stability, regularity, and extrapolation.

Abstract

We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs -- a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions vis-à-vis low-norm solutions (i.e., weight decay), which knowingly do not suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of ''neural shattering'' where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.

Paper Structure

This paper contains 57 sections, 35 theorems, 235 equations, 17 figures.

Key Result

Proposition 2.1

Suppose that $\eta < 2$. A minimum ${\bm{\theta}}^\star$ is linearly stableIn particular, this holds for the definition of linear stability where $\mu({\bm{\theta}}^\star) \leq 0$ in the notation of chemnitz2025characterizing, which is a strictly weaker notion of linear stability than that of wu2018

Figures (17)

  • Figure 1: The "neural shattering" phenomenon: From empirical observations to its geometric origin and theoretical consequences. Left panel: Training with a large learning rate and gradient descent empirically results in "neural shattering": Neurons develop large weights despite activating on very few inputs, leading to a high MSE of $\approx 1.105$ (red points). In contrast, explicit $\ell^2$-regularization prevents this, achieving a much lower MSE of $\approx 0.055$ (orange points). Middle panel: The number of distinct directions, or "caps", on a high-dimensional sphere grows exponentially. Consequently, the data sites are spread thinly across these caps. This makes it trivial for a ReLU neuron to find a direction that isolates only a few data points. This sparse activation pattern allows neurons to use large weight magnitudes for this local fitting without impacting the global loss curvature, thus "tricking" the flatness criterion. Right panel: Visualization of "hard-to-learn" function from our minimax lower bound construction, built from localized ReLU neurons described in the middle panel.
  • Figure 2: Empirical validation of the curse of dimensionality. Left panel: The slope of $\log\mathrm{MSE}$ versus $\log n$ for training with vanilla gradient descent rapidly decreases with dimension, falling to about 0.1 at $d=5$. Right panel: Training with $\ell^2$ (weight decay) results in slopes above $0.5$ in the log–log scale.
  • Figure 3: The top‐left plot illustrates the neural shattering phenomenon: after large‐step training each ReLU neuron (orange) is active on only a tiny fraction of the data (small horizontal support) yet its weight norm remains large, exactly as in our sphere‐packing lower‐bound construction where each outward‐facing ReLU atom fires on very few inputs but retains full peak amplitude.
  • Figure 4: Comparison across input dimension $d$ for a two-layer ReLU network of width 1024 trained on 512 samples for 20000 epochs with learning rate $\eta=0.5$. At $d=1$, all neurons extrapolate (0% active), while as $d$ increases the fraction of neurons surviving training rises dramatically (up to 65% at $d=6$). Simultaneously, the training loss monotonically decreases whereas the training MSE increases with $d$, demonstrating that neural shattering under large learning rates may be the key driver of the curse of dimensionality in stable minima.
  • Figure 5: Effect of increasing learning rate $\eta$ on shattering ($\eta\times\text{epochs}=10000$): as $\eta$ grows, the stability/flatness constraint forces an ever larger fraction of neurons to activate only on a small subset of the data (neural shattering). To further decrease the training loss, gradient descent correspondingly increases the weight norms of the remaining active neurons.
  • ...and 12 more figures

Theorems & Definitions (67)

  • Proposition 2.1
  • Example 3.1
  • Theorem 3.2
  • Corollary 3.3
  • Theorem 3.4
  • Theorem 3.5
  • Theorem 3.6
  • Theorem 3.7
  • Lemma C.1
  • proof
  • ...and 57 more