A topological description of loss surfaces based on Betti Numbers
Maria Sofia Bucarelli, Giuseppe Alessio D'Inverno, Monica Bianchini, Franco Scarselli, Fabrizio Silvestri
TL;DR
The paper introduces a topological lens for loss landscapes in neural networks by examining the Betti numbers of sublevel sets $S_{\mathcal{N}} = \{ \theta \; | \; \mathcal{L}_{\mathcal{N}}(\theta) \leq c \}$ under Pfaffian activation functions. It proves that both MSE and BCE losses are Pfaffian with explicit formats tied to network depth $L$, width $h$, and last-layer nonlinearity, enabling Betti-number bounds $B(S_{\mathcal{N}})$ that scale super-exponentially with depth and width and exponentially with the number of samples $m$. The results further show that adding $\ell^2$ regularization or skip connections does not affect the topological bounds within the analysis, offering a principled explanation for the observed stability of loss topology under these architectural changes. By deriving corollaries for common activations (sigmoid, tanh) and comparing deep versus shallow regimes, the work links architectural choices and data regime to the intrinsic complexity of optimization landscapes, with implications for understanding training difficulty and guiding design choices. The framework paves the way for future work connecting Pfaffian-topology bounds with Morse theory and tighter, possibly component-wise, characterizations of loss landscape connectivity.
Abstract
In the context of deep learning models, attention has recently been paid to studying the surface of the loss function in order to better understand training with methods based on gradient descent. This search for an appropriate description, both analytical and topological, has led to numerous efforts to identify spurious minima and characterize gradient dynamics. Our work aims to contribute to this field by providing a topological measure to evaluate loss complexity in the case of multilayer neural networks. We compare deep and shallow architectures with common sigmoidal activation functions by deriving upper and lower bounds on the complexity of their loss function and revealing how that complexity is influenced by the number of hidden units, training models, and the activation function used. Additionally, we found that certain variations in the loss function or model architecture, such as adding an $\ell_2$ regularization term or implementing skip connections in a feedforward network, do not affect loss topology in specific cases.
