Table of Contents
Fetching ...

The Non-Linearity Perturbation Threshold: Width Scaling and Landscape Bifurcations in Deep Learning

Michael Alexander

Abstract

We study how the optimization landscape of a neural network deforms as a non-linear activation is introduced through a smooth homotopy. Working first in an abstract local setting - a smooth one-parameter family of objective functions together with a critical branch that loses non-degeneracy through a simple Hessian kernel - we show via Lyapunov-Schmidt reduction that the local transition is controlled by the classical codimension-one normal forms (transcritical or pitchfork) and that the associated topology change is governed by Morse-theoretic handle attachment. We then move beyond the abstract framework and verify these assumptions for a concrete two-layer architecture. We prove that bilinear overparameterization creates an (m-1)d-dimensional Hessian kernel at the linear endpoint, which Tikhonov regularization lifts to a floor alpha > 0; the activation homotopy softens this floor, yielding an explicit bifurcation point lambda* approximately equal to alpha/|lambda_1'(0)|. We derive the eigenvalue-softening rate as a functional of activation derivatives and data moments, and prove that the near-pitchfork normal form (|g_aa/g_aaa| much less than 1) is a structural consequence of sigma''(0)=0 for tanh-like activations. The bifurcation point scales as lambda* proportional to alpha m with network width, connecting the framework to the NTK regime: at large m the landscape reorganization is pushed past lambda=1 and the linearized picture prevails. The foundational algebraic theorems have been formally verified in the Lean 4 theorem prover, and theoretical predictions computed for widths m in {3, 5, 10, 20, 50, 100} exhibit quantitative agreement with the abstract framework.

The Non-Linearity Perturbation Threshold: Width Scaling and Landscape Bifurcations in Deep Learning

Abstract

We study how the optimization landscape of a neural network deforms as a non-linear activation is introduced through a smooth homotopy. Working first in an abstract local setting - a smooth one-parameter family of objective functions together with a critical branch that loses non-degeneracy through a simple Hessian kernel - we show via Lyapunov-Schmidt reduction that the local transition is controlled by the classical codimension-one normal forms (transcritical or pitchfork) and that the associated topology change is governed by Morse-theoretic handle attachment. We then move beyond the abstract framework and verify these assumptions for a concrete two-layer architecture. We prove that bilinear overparameterization creates an (m-1)d-dimensional Hessian kernel at the linear endpoint, which Tikhonov regularization lifts to a floor alpha > 0; the activation homotopy softens this floor, yielding an explicit bifurcation point lambda* approximately equal to alpha/|lambda_1'(0)|. We derive the eigenvalue-softening rate as a functional of activation derivatives and data moments, and prove that the near-pitchfork normal form (|g_aa/g_aaa| much less than 1) is a structural consequence of sigma''(0)=0 for tanh-like activations. The bifurcation point scales as lambda* proportional to alpha m with network width, connecting the framework to the NTK regime: at large m the landscape reorganization is pushed past lambda=1 and the linearized picture prevails. The foundational algebraic theorems have been formally verified in the Lean 4 theorem prover, and theoretical predictions computed for widths m in {3, 5, 10, 20, 50, 100} exhibit quantitative agreement with the abstract framework.

Paper Structure

This paper contains 40 sections, 13 theorems, 35 equations, 5 figures, 1 table.

Key Result

Lemma 4.1

Under Assumptions ass:smooth-branch and ass:simple-degeneracy, there exist neighborhoods of $(a,\mu)=(0,0)$ and a unique smooth map $w^*(a,\mu)\in\mathcal{R}$ with $w^*(0,0)=0$ such that $P_{\mathcal{R}}\widetilde{F}\bigl(av_0+w^*(a,\mu),\mu\bigr)=0$. Moreover, critical points of $\widetilde{L}$ nea where $g(a,\mu):=P_{\mathcal{N}}\widetilde{F}\bigl(av_0+w^*(a,\mu),\mu\bigr)\cdot v_0$, and equival

Figures (5)

  • Figure 1: Master summary of numerical results. Row 1: Loss landscape contours for the symmetric toy model ($d=1$, $m=2$, $v=[1,1]$) at four values of $\lambda$; the diagonal $w_1=w_2$ is marked in cyan. Row 2, left: Hessian eigenvalues along the diagonal branch showing $S_2$-induced zero transverse eigenvalue at $\lambda=0$. Row 2, right: Reduced potential along the anti-diagonal, showing monotonically deepening double well. Row 3, left: Full Hessian spectrum for the higher-dimensional model ($d=5$, $m=10$), with genuine eigenvalue crossing at $\lambda^*\approx 0.72$. Row 3, right: Zoomed eigenvalue tracking near $\lambda^*$ confirming simple kernel and nonzero crossing speed. Row 4: Reduced potential slices, polynomial fit, and 2D contours in the center manifold before and after bifurcation.
  • Figure 2: Transversality and solution-quality verification for the $d=5$, $m=10$ model. Left: The reduced coefficient $c_2(\mu)$ is linear in $\mu=\lambda-\lambda^*$ with slope $\approx 0.044$, confirming nonzero transversality. Center: Loss along the continued branch; the bifurcation at $\lambda^*\approx 0.72$ is a spectral event, not a loss discontinuity. Right: Gradient norm along the branch, confirming that continuation tracks a genuine critical point.
  • Figure 3: Phase diagram for the bifurcation point. Left: Bifurcation parameter $\lambda^*$ as a function of the symmetry-breaking parameter $\alpha$ in the toy model $v=[1,\alpha]$; circles denote interior crossings, crosses denote boundary degeneracies. Center: Spectral softening curves for selected $\alpha$ values. Right: Loss and minimum eigenvalue for the $d=5$, $m=10$ model.
  • Figure 4: Width-scaling analysis. Top left: Spectral softening curves for $m=3$ to $m=100$; dots mark the zero-crossing $\lambda^*$. Top center: Bifurcation point $\lambda^*$ vs. width $m$, with first-order theoretical prediction and fit $\lambda^*\approx 0.97-2.03/m$. Top right: Log-log scaling of $\lambda_1(0)$ (constant = $\alpha$) and $|\lambda_1'(0)|\sim m^{-1.2}$. Bottom left:$\lambda^*$ vs. $1/m$ showing the linear relationship predicted by Eq. \ref{['eq:lam-star-scaling']}. Bottom center: Simple-kernel ratio $|\lambda_1/\lambda_2|$ at $\lambda^*$. Bottom right: Theory vs. numerical $\lambda^*$.
  • Figure 5: Verification of explicit constants. Top row: Theory-vs-numerical comparison for $W^*(0)$ (Sylvester solution), eigenvalue trajectory with 1st/2nd-order predictions, and decomposition of $g_{aa}$ and $g_{aaa}$ into Gauss--Newton and residual contributions. Middle row: Width-scaling constant $K$ from fitting $\lambda_1'(0)=K/m$; constancy of $\lambda_1'(0)\cdot m$; $\lambda^*$ theory vs. numerical; and $\lambda_1(0)=\alpha$ verification. Bottom row: Theory-vs-numerical $\lambda_1'(0)$, reduced potential fit, overcritical $\lambda_1(1)$ check, and summary of all explicit constants.

Theorems & Definitions (35)

  • Lemma 4.1: Lyapunov--Schmidt reduction
  • proof
  • Remark 4.2
  • Theorem 5.2: Transcritical bifurcation
  • proof
  • Remark 5.3: Saddle-node outside the branch-preserving setting
  • Corollary 5.4: $\mathbb Z_2$-equivariant pitchfork bifurcation
  • proof
  • Remark 5.5: Morse index change
  • Theorem 6.1: Local Morse handle attachment
  • ...and 25 more