Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape
Ioannis Bantzis, James B. Simon, Arthur Jacot
TL;DR
The paper addresses how gradient descent escapes the origin saddle in deep ReLU networks initialized with small weights, introducing escape directions and the concept of Saddle-to-Saddle dynamics. It develops a gradient-flow framework for homogeneous losses, proving that the optimal escape directions exhibit a strong depth-dependent low-rank bias, with deeper layers becoming nearly rank-1 and more linear, and that the optimal escape speed increases with depth. The work also connects these findings to Bottleneck Rank (BN-rank) ideas, showing a half-bottleneck structure in the first escape and providing MNIST experiments that corroborate the low-rank tendency in later layers, plus a counterexample illustrating that exact rank-1 escapes are not always optimal. The results offer a first step toward a rigorous Saddle-to-Saddle theory in deep ReLU networks and motivate a BN-rank incremental learning view of progressive sparsity and feature learning in deep nets. Overall, the work links initialization scale, depth, and low-rank structure to the trajectory of GD through a sequence of saddles, suggesting practical implications for understanding and guiding deep network training dynamics.
Abstract
When a deep ReLU network is initialized with small weights, GD is at first dominated by the saddle at the origin in parameter space. We study the so-called escape directions, which play a similar role as the eigenvectors of the Hessian for strict saddles. We show that the optimal escape direction features a low-rank bias in its deeper layers: the first singular value of the $\ell$-th layer weight matrix is at least $\ell^{\frac{1}{4}}$ larger than any other singular value. We also prove a number of related results about these escape directions. We argue that this result is a first step in proving Saddle-to-Saddle dynamics in deep ReLU networks, where GD visits a sequence of saddles with increasing bottleneck rank.
