Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Piyush Sao

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Piyush Sao

Abstract

Optimization analyses for cross-entropy training rely on local Taylor models of the loss to predict whether a proposed step will decrease the objective. These surrogates are reliable only inside the Taylor convergence radius of the true loss along the update direction. That radius is set not by real-line curvature alone but by the nearest complex singularity. For cross-entropy, the softmax partition function $F=\sum_j \exp(z_j)$ has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is $ρ^*=\sqrt{δ^2+ π^2}/Δ_a$. In the multiclass case, we obtain the lower bound $ρ_a=π/Δ_a$, where $Δ_a=\max_k a_k-\min_k a_k$ is the spread of directional logit derivatives $a_k=\nabla z_k\cdot v$. This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size $r=τ/ρ_a$ separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for $r<1$, yet collapse appears once $r\ge 1$. Temperature scaling confirms the mechanism: normalizing by $ρ_a$ shrinks the onset-threshold spread from standard deviation $0.992$ to $0.164$. A controller that enforces $τ\leρ_a$ survives learning-rate spikes up to $10{,} 000\times$ in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Abstract

has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is

. In the multiclass case, we obtain the lower bound

, where

is the spread of directional logit derivatives

. This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size

separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for

, yet collapse appears once

. Temperature scaling confirms the mechanism: normalizing by

shrinks the onset-threshold spread from standard deviation

. A controller that enforces

survives learning-rate spikes up to

in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.

Paper Structure (135 sections, 9 theorems, 67 equations, 13 figures, 3 tables)

This paper contains 135 sections, 9 theorems, 67 equations, 13 figures, 3 tables.

Introduction
Preliminaries
Analytic Functions and Taylor Series
A Simple Example: f(x) = 1/(x+a)
When does this divergence occur?
Implications for neural networks
Key insight.
Optimization and the Convergence Radius
Taylor expansion of the update
The implicit assumption
Inside vs. outside the radius
Curvature versus convergence radius
Directional convergence radius
What remains.
Problem Formulation
...and 120 more sections

Key Result

Theorem 2.1

The Taylor series of $f$ around $x_0$ converges for $|x - x_0| < R$, where $R$ is the distance from $x_0$ to the nearest point where $f$ fails to be analytic (a singularity) conway1978functions. The series diverges for $|x - x_0| > R$.

Figures (13)

Figure 1: We train six small architectures to convergence, then take one step at varying ratios $r = \tau/\rho_a$, where $\tau$ is the step distance and $\rho_a$ is our stability bound (defined in the text). Every network keeps its accuracy for $r < 1$; once $r$ crosses 1, collapse appears across all six architectures.
Figure 2: Taylor approximations $T_n(x)$ of $f(x) = 1/(x+a)$ around $x_0 = 0$. Inside the convergence radius $R = a$ (green), all orders approximate $f$ well. Beyond $R$ (red), higher-order approximations diverge faster, not slower.
Figure 3: Binary cross-entropy's log-partition $\log(1 + e^x)$ has convergence radius $\rho = \pi$ set by the complex zero at $i\pi$ (Euler: $1 + e^{i\pi} = 0$). Shown: derivative $\sigma(x)$ on log scale. Green: Taylor converges ($|x| < \pi$). Red: diverges ($|x| > \pi$).
Figure 4: Half-plane geometry behind Lemma \ref{['lem:halfplane']}. If the phases of $V_k = w_k e^{a_k t}$ lie in an open arc of length $< \pi$, then all vectors remain in a common open half-plane, so no positive combination can cancel to zero.
Figure 5: Complex-$t$ consequence of Theorem \ref{['thm:general']}. The half-plane obstruction excludes zeros from the strip $|\mathrm{Im}(t)| < \pi/\Delta_a$. Since the disk $|t| < \pi/\Delta_a$ lies inside that strip, the minimal zero modulus satisfies $\rho^* \ge \pi/\Delta_a$.
...and 8 more figures

Theorems & Definitions (19)

Theorem 2.1: Cauchy--Hadamard
Definition 4.1: Exact Convergence Radius
Proposition 4.2: Radius from Partition Zeros
proof
Theorem 4.3: Binary Convergence Radius
proof
Definition 4.4: Logit-Derivative Spread
Lemma 4.5: Half-Plane Obstruction
proof
Theorem 4.6: General Lower Bound
...and 9 more

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Abstract

Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

Authors

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (19)