Deconstructing the Goldilocks Zone of Neural Network Initialization
Artem Vysogorets, Anna Dawid, Julia Kempe
TL;DR
This work analyzes the Goldilocks zone—the region in parameter space with excess positive curvature of the training loss Hessian—within homogeneous neural networks. Using the Gauss-Newton decomposition, it shows that the zone is governed by the G-term rather than initialization norm alone, and demonstrates scale-invariance via softmax temperature so that the same curvature behavior can appear across different norms. A fundamental condition, $ orm{ ext{G}}_2 \, ext{vs.}\, orm{ ext{H}}_2$, is derived to characterize the presence of excess curvature, with the G-term tied to model confidence and vanishing logits; a new vanishing cross-entropy gradient phenomenon is also uncovered when the data prior and model outputs align. The paper couples analytic results with gradient-descent experiments on scaled homogeneous nets, revealing that high curvature is not a reliable sole predictor of trainability and that dynamics near the zone edge can exhibit surprising behaviors such as zero logits, lazy learning, or divergence. Overall, the work calls for rethinking initialization diagnostics and highlights the nuanced, scale- and data-dependent nature of optimization in deep networks.
Abstract
The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship.
