Table of Contents
Fetching ...

Deconstructing the Goldilocks Zone of Neural Network Initialization

Artem Vysogorets, Anna Dawid, Julia Kempe

TL;DR

This work analyzes the Goldilocks zone—the region in parameter space with excess positive curvature of the training loss Hessian—within homogeneous neural networks. Using the Gauss-Newton decomposition, it shows that the zone is governed by the G-term rather than initialization norm alone, and demonstrates scale-invariance via softmax temperature so that the same curvature behavior can appear across different norms. A fundamental condition, $ orm{ ext{G}}_2 \, ext{vs.}\, orm{ ext{H}}_2$, is derived to characterize the presence of excess curvature, with the G-term tied to model confidence and vanishing logits; a new vanishing cross-entropy gradient phenomenon is also uncovered when the data prior and model outputs align. The paper couples analytic results with gradient-descent experiments on scaled homogeneous nets, revealing that high curvature is not a reliable sole predictor of trainability and that dynamics near the zone edge can exhibit surprising behaviors such as zero logits, lazy learning, or divergence. Overall, the work calls for rethinking initialization diagnostics and highlights the nuanced, scale- and data-dependent nature of optimization in deep networks.

Abstract

The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship.

Deconstructing the Goldilocks Zone of Neural Network Initialization

TL;DR

This work analyzes the Goldilocks zone—the region in parameter space with excess positive curvature of the training loss Hessian—within homogeneous neural networks. Using the Gauss-Newton decomposition, it shows that the zone is governed by the G-term rather than initialization norm alone, and demonstrates scale-invariance via softmax temperature so that the same curvature behavior can appear across different norms. A fundamental condition, , is derived to characterize the presence of excess curvature, with the G-term tied to model confidence and vanishing logits; a new vanishing cross-entropy gradient phenomenon is also uncovered when the data prior and model outputs align. The paper couples analytic results with gradient-descent experiments on scaled homogeneous nets, revealing that high curvature is not a reliable sole predictor of trainability and that dynamics near the zone edge can exhibit surprising behaviors such as zero logits, lazy learning, or divergence. Overall, the work calls for rethinking initialization diagnostics and highlights the nuanced, scale- and data-dependent nature of optimization in deep networks.

Abstract

The second-order properties of the training loss have a massive impact on the optimization dynamics of deep learning models. Fort & Scherlis (2019) discovered that a large excess of positive curvature and local convexity of the loss Hessian is associated with highly trainable initial points located in a region coined the "Goldilocks zone". Only a handful of subsequent studies touched upon this relationship, so it remains largely unexplained. In this paper, we present a rigorous and comprehensive analysis of the Goldilocks zone for homogeneous neural networks. In particular, we derive the fundamental condition resulting in excess of positive curvature of the loss, explaining and refining its conventionally accepted connection to the initialization norm. Further, we relate the excess of positive curvature to model confidence, low initial loss, and a previously unknown type of vanishing cross-entropy loss gradient. To understand the importance of excessive positive curvature for trainability of deep networks, we optimize fully-connected and convolutional architectures outside the Goldilocks zone and analyze the emergent behaviors. We find that strong model performance is not perfectly aligned with the Goldilocks zone, calling for further research into this relationship.
Paper Structure (31 sections, 13 equations, 13 figures)

This paper contains 31 sections, 13 equations, 13 figures.

Figures (13)

  • Figure 1: The Goldilocks zone is an area of excess of positive curvature of the loss. Left: Originally, the Goldilocks zone is observed for a narrow range of initialization scales. Middle: Setting the appropriate softmax temperature $T$ allows for excess positive curvature at initialization of any norm. Right: Recreating the Goldilocks zone at an unscaled initialization just by varying $T$.
  • Figure 2: Positive curvature (top) and spectral norm (bottom) of the Hessian, G-term, and H-term across initialization scales. We computed these quantities on a low-rank subspace with $d=50$. Left: LeNet-300-100 (fully-connected) on FashionMNIST; Right: LeNet-5 (convolutional) on CIFAR-10.
  • Figure 3: Excess of positive eigenvalues of the true G-term $\mathcal{G}$ and the expected G-term $\mathbb{E}\mathcal{G}$ from \ref{['Eq:trace-over-norm']} computed using logits associated with different initialization scales (model confidence). We used a low-rank subspace with $d=50$. Error bands represent min/max across $3$ seeds. Left: LeNet-300-100 on FashionMNIST; Right: LeNet-5 on CIFAR-10.
  • Figure 4: Left: The dependence of excess of positive eigenvalues on the dimension $d$ of the low-rank hyperplane used to compute the projected G-term. Right:$\Gamma_p^{-2}$ of the matrix $\text{diag}(p)-pp^{\top}$ where vectors $p$ are produced by scaling $30$ different logit sets by $\alpha\in[10^{-2}, 10^2]$ ($30$ pink curves). The red curve corresponds to $\Gamma_p^{-2}$ computed for the average matrix $\frac{1}{30}\sum_{\mu=1}^{30}\text{diag}(p^{\mu})-p^{\mu}{p^{\mu}}^{\top}$.
  • Figure 5: The effects of the average softmax output $\hat{Q}$ and the actual target prior $Q$ on the batch gradient. (a): Cosine similarity of gradients computed on different datasets (SVHN, CIFAR-10, and a randomly generated dataset) at the same initialization of LeNet-5 (downscaled by $\alpha=0.01$ when $\hat{Q}=\text{uniform}$). To achieve $Q\ne \text{uniform}$, we inject artificial class imbalance by subsampling datasets using a procedure suggested by cui. (b): Given a fixed unscaled initialization of LeNet-5, we sample $2,000$ different class priors $Q$ uniformly from a probability simplex $\Delta_{10}$. For each of them, we sample a subset of CIFAR-10 with $5,000$, compute the cross-entropy gradient, and plot its norm against $\lVert\hat{Q}-Q\rVert_2$. As we predicted, the norms follow a linear trend with the predicted slope ($\sigma_c^2$ was estimated on the entire CIFAR-10 dataset; $d=P=61,170$).
  • ...and 8 more figures