Visualizing high-dimensional loss landscapes with Hessian directions

Lucas Böttcher; Gregory Wheeler

Visualizing high-dimensional loss landscapes with Hessian directions

Lucas Böttcher, Gregory Wheeler

TL;DR

The paper tackles the challenge of visualizing high-dimensional neural network loss landscapes by showing that random low-dimensional projections often misrepresent saddle points due to curvature distortions. It establishes a theoretical link between the curvature observed in random projections and the Hessian of the original loss, showing that the mean projected curvature $\bar{\kappa}^{\alpha,\beta}$ equals the Hessian trace $\mathrm{tr}(H_\theta)$, thereby enabling Hutchinson-type Hessian-trace estimates without explicit Hessian-vector products. The authors propose projecting along dominant Hessian directions (largest positive and negative curvatures) to faithfully reveal saddle structure and demonstrate this approach on large neural nets (up to ~7 million parameters), with improvements over random projections in both visualization and optimization potential. The work provides a principled framework for curvature-based landscape analysis that informs the flatness-generalization discussion and offers practical tools for curvature estimation and visualization, supported by public code.

Abstract

Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e., the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable to correctly identify saddle information, we propose to study projections along dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to about $7\times 10^6$ parameters.

Visualizing high-dimensional loss landscapes with Hessian directions

TL;DR

equals the Hessian trace

, thereby enabling Hutchinson-type Hessian-trace estimates without explicit Hessian-vector products. The authors propose projecting along dominant Hessian directions (largest positive and negative curvatures) to faithfully reveal saddle structure and demonstrate this approach on large neural nets (up to ~7 million parameters), with improvements over random projections in both visualization and optimization potential. The work provides a principled framework for curvature-based landscape analysis that informs the flatness-generalization discussion and offers practical tools for curvature estimation and visualization, supported by public code.

Abstract

parameters.

Paper Structure (16 sections, 41 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 16 sections, 41 equations, 9 figures, 2 tables, 1 algorithm.

Introduction
Principal curvature in random projections
Differential and information geometry concepts
Random projections
Principal curvature
Hessian trace estimates
Illustrative examples
Extracting curvature information
Hessian trace
Hessian directions
Applications to neural networks
Discussion and conclusion
Concentration inequality
Annihilation method
Image classification
...and 1 more sections

Figures (9)

Figure 1: Convergence of the ensemble mean \ref{['eq:ensemble_average']} of Hessian elements and curvatures measures as a function of the number of random projections $S$. (a,c) The deviation of the ensemble means $\langle (H_{\alpha,\beta})_{ij}\rangle$ (${i,j\in\{1,2\}}$) of Hessian elements from the corresponding expected values as a function of $S$. Notice that the expected value of the diagonal elements $(H_{\alpha,\beta})_{11}$ and $(H_{\alpha,\beta})_{22}$ is equal to $\bar{\kappa}^{\alpha,\beta}$ (i.e., to the sum of principal curvatures in the original space) [see Eqs. \ref{['eq:expected_hessian']} and \ref{['eq:trace_estimate']}]. A relatively large number of random projections between $10^3$ and $10^4$ is required to keep the deviations at values smaller than about 2--4. (b,d) The ensemble means $\langle \kappa^{\alpha,\beta}_{\pm}\rangle$ [see Eq. \ref{['eq:kapp_alpha_beta']}] and $\langle \tilde{\kappa}^{\alpha,\beta}_{\pm}\rangle$ [see Eq. \ref{['eq:kappa_tilde_sample']}] as a function of $S$. Dashed grey lines represent $\bar{\kappa}^{\alpha,\beta}=\mathrm{tr}(H_\theta)$. In panels (a,b) and (c,d), the $N$-dimensional loss functions are given by Eqs. \ref{['eq:loss_symmetric']} and \ref{['eq:loss_asymmetric']}, respectively. We evaluate the corresponding Hessians \ref{['eq:emp_hessian_1']} and \ref{['eq:emp_hessian_2']} at the saddle point $\theta^*=(\theta^*_1,\dots,\theta^*_{2n},\theta^*_{2n+1})=(0,\dots,0,1)$. In both loss functions, we set $n=500$ and in loss function \ref{['eq:loss_asymmetric']} we set $\tilde{n}=800$.
Figure 2: Distribution of principal curvatures $\kappa_{-}^{\alpha,\beta}$ (red bars) and $\kappa_{+}^{\alpha,\beta}$ (black bars). In panels (a) and (b), the loss functions are given by Eqs. \ref{['eq:loss_symmetric']} and \ref{['eq:loss_asymmetric']}, respectively. We evaluate the corresponding Hessians \ref{['eq:emp_hessian_1']} and \ref{['eq:emp_hessian_2']} at the saddle point $\theta^*=(\theta^*_1,\dots,\theta^*_{2n},\theta^*_{2n+1})=(0,\dots,0,1)$. In both loss functions, we set $n=500$ and in loss function \ref{['eq:loss_asymmetric']} we set $\tilde{n}=800$. While in panel (a), the probability $\Pr(\kappa_+^{\alpha,\beta}\kappa_-^{\alpha,\beta}>0)$ that the critical point in the lower-dimensional, random projection does not appear as a saddle is about 0.3, it is 1 in panel (b). Histograms are based on 10,000 random projections that are used to compute $\kappa_{\pm}^{\alpha,\beta}$. Solid grey lines indicate Gaussian approximations of the empirical distributions.
Figure 3: Estimating the trace of the Hessian $H_\theta$. In panels (a) and (b), the loss functions are given by Eqs. \ref{['eq:loss_symmetric']} and \ref{['eq:loss_asymmetric']}, respectively. We evaluate the corresponding Hessians \ref{['eq:emp_hessian_1']} and \ref{['eq:emp_hessian_2']} at the saddle point $\theta^*=(\theta^*_1,\dots,\theta^*_{2n},\theta^*_{2n+1})=(0,\dots,0,1)$. In both loss functions, we set $n=500$ and in loss function \ref{['eq:loss_asymmetric']} we set $\tilde{n}=800$. Solid black and dash-dotted red lines represent Hutchinson [$\langle z^\top H_\theta z \rangle$; see Eq. \ref{['eq:hutchinson']}] and curvature-based ($\langle \kappa^\alpha\rangle$) estimates of $\mathrm{tr}(H_\theta)$, respectively. We compute ensemble means $\langle \cdot \rangle$ as defined in Eq. \ref{['eq:ensemble_average']} for different numbers of random projections $S$. The trace estimates in panels (a) and (b), respectively, converge towards the true trace values $\mathrm{tr}(H_\theta)=0$ and $\mathrm{tr}(H_\theta)=600$ that are indicated by dashed grey lines. In both methods, the same random vectors with elements that are distributed according to a standard normal distribution are used. For the curvature-based estimation of $\mathrm{tr}(H_\theta)$, we perform least-square fits of $L(\theta^*+\alpha\eta)$ over an interval $\alpha\in[-0.05,0.05]$.
Figure 4: Dimensionality-reduced loss $L(\theta^*+\alpha \eta+\beta\delta)$ of Eq. \ref{['eq:loss_asymmetric']} with $n=900,\tilde{n}=1000$ for different directions $\eta,\delta$. (a--c) The directions $\eta,\delta$ correspond to eigenvectors of the Hessian $H_\theta$ of Eq. \ref{['eq:loss_asymmetric']}. If the eigenvalues associated with $\eta,\delta$ have different signs, the corresponding loss landscape is a saddle as depicted in panel (a). If the eigenvalues associated with $\eta,\delta$ have the same sign, the corresponding loss landscape is either a minimum (both signs are positive) as shown in panel (b) or a maximum (both signs are negative) as shown in panel (c). Because there is an excess of $\tilde{n}-n=100$ positive eigenvalues in $H_\theta$, a projection onto a dimension-reduced space that is spanned by the random directions $\eta,\delta$ is often associated with an apparently convex loss landscape. An example of such an apparent minimum is shown in panel (d). We selected a single pair of random directions $\eta,\delta$ (i.e., no averaging over random directions has been performed).
Figure 5: Loss landscape projections for ResNet-56. (a,c) The projection directions $\eta,\delta$ are given by the eigenvectors associated with the largest and smallest eigenvalues of the Hessian $H_\theta$, respectively. The zoomed inset in panel (c) shows the loss landscape for $(\alpha,\beta)\in[-0.01,0.005]\times[-0.05,0.05]$. We observe a decreasing loss along the negative $\beta$-axis. (b,d) The projection directions $\eta,\delta$ are given by random vectors. We selected a single pair of random directions $\eta,\delta$ (i.e., no averaging over random directions has been performed). The domains of $(\alpha,\beta)$ in panels (a,b) and (c,d) are $[-0.05,0.05]\times[-0.05,0.05]$ and $[-1,1]\times[-1,1]$, respectively. All shown cross-entropy loss landscapes are based on evaluating the CIFAR-10 training dataset that consists of 50,000 images.
...and 4 more figures

Visualizing high-dimensional loss landscapes with Hessian directions

TL;DR

Abstract

Visualizing high-dimensional loss landscapes with Hessian directions

Authors

TL;DR

Abstract

Table of Contents

Figures (9)