Table of Contents
Fetching ...

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Piyush Sao

Abstract

Optimization analyses for cross-entropy training rely on local Taylor models of the loss to predict whether a proposed step will decrease the objective. These surrogates are reliable only inside the Taylor convergence radius of the true loss along the update direction. That radius is set not by real-line curvature alone but by the nearest complex singularity. For cross-entropy, the softmax partition function $F=\sum_j \exp(z_j)$ has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is $ρ^*=\sqrt{δ^2+ π^2}/Δ_a$. In the multiclass case, we obtain the lower bound $ρ_a=π/Δ_a$, where $Δ_a=\max_k a_k-\min_k a_k$ is the spread of directional logit derivatives $a_k=\nabla z_k\cdot v$. This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size $r=τ/ρ_a$ separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for $r<1$, yet collapse appears once $r\ge 1$. Temperature scaling confirms the mechanism: normalizing by $ρ_a$ shrinks the onset-threshold spread from standard deviation $0.992$ to $0.164$. A controller that enforces $τ\leρ_a$ survives learning-rate spikes up to $10{,} 000\times$ in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Abstract

Optimization analyses for cross-entropy training rely on local Taylor models of the loss to predict whether a proposed step will decrease the objective. These surrogates are reliable only inside the Taylor convergence radius of the true loss along the update direction. That radius is set not by real-line curvature alone but by the nearest complex singularity. For cross-entropy, the softmax partition function has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is . In the multiclass case, we obtain the lower bound , where is the spread of directional logit derivatives . This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for , yet collapse appears once . Temperature scaling confirms the mechanism: normalizing by shrinks the onset-threshold spread from standard deviation to . A controller that enforces survives learning-rate spikes up to in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.
Paper Structure (135 sections, 9 theorems, 67 equations, 13 figures, 3 tables)

This paper contains 135 sections, 9 theorems, 67 equations, 13 figures, 3 tables.

Key Result

Theorem 2.1

The Taylor series of $f$ around $x_0$ converges for $|x - x_0| < R$, where $R$ is the distance from $x_0$ to the nearest point where $f$ fails to be analytic (a singularity) conway1978functions. The series diverges for $|x - x_0| > R$.

Figures (13)

  • Figure 1: We train six small architectures to convergence, then take one step at varying ratios $r = \tau/\rho_a$, where $\tau$ is the step distance and $\rho_a$ is our stability bound (defined in the text). Every network keeps its accuracy for $r < 1$; once $r$ crosses 1, collapse appears across all six architectures.
  • Figure 2: Taylor approximations $T_n(x)$ of $f(x) = 1/(x+a)$ around $x_0 = 0$. Inside the convergence radius $R = a$ (green), all orders approximate $f$ well. Beyond $R$ (red), higher-order approximations diverge faster, not slower.
  • Figure 3: Binary cross-entropy's log-partition $\log(1 + e^x)$ has convergence radius $\rho = \pi$ set by the complex zero at $i\pi$ (Euler: $1 + e^{i\pi} = 0$). Shown: derivative $\sigma(x)$ on log scale. Green: Taylor converges ($|x| < \pi$). Red: diverges ($|x| > \pi$).
  • Figure 4: Half-plane geometry behind Lemma \ref{['lem:halfplane']}. If the phases of $V_k = w_k e^{a_k t}$ lie in an open arc of length $< \pi$, then all vectors remain in a common open half-plane, so no positive combination can cancel to zero.
  • Figure 5: Complex-$t$ consequence of Theorem \ref{['thm:general']}. The half-plane obstruction excludes zeros from the strip $|\mathrm{Im}(t)| < \pi/\Delta_a$. Since the disk $|t| < \pi/\Delta_a$ lies inside that strip, the minimal zero modulus satisfies $\rho^* \ge \pi/\Delta_a$.
  • ...and 8 more figures

Theorems & Definitions (19)

  • Theorem 2.1: Cauchy--Hadamard
  • Definition 4.1: Exact Convergence Radius
  • Proposition 4.2: Radius from Partition Zeros
  • proof
  • Theorem 4.3: Binary Convergence Radius
  • proof
  • Definition 4.4: Logit-Derivative Spread
  • Lemma 4.5: Half-Plane Obstruction
  • proof
  • Theorem 4.6: General Lower Bound
  • ...and 9 more