Table of Contents
Fetching ...

Dynamical loss functions shape landscape topography and improve learning in artificial neural networks

Eduardo Lavin Pallero, Miguel Ruiz-Garcia

TL;DR

The paper addresses how oscillatory, class-wise weights in loss functions can reshape the loss landscape to improve learning in neural networks. It introduces dynamical cross-entropy $\mathcal{F}_{\mathrm{DCE}}$ with per-class oscillations $\Gamma_i(t)$ and two dynamical MSE variants $\mathcal{F}_{\mathrm{DMSE1}}, \mathcal{F}_{\mathrm{DMSE2}}$, governed by amplitude $A$ and period $T$, preserving the global minima while altering optimization trajectories. Empirically, these dynamical losses yield higher validation accuracy than static losses, especially for small networks, and interact with curvature to produce period-doubling instabilities that enhance exploration in the loss landscape. The work links dynamical loss design to edge-of-stability minimization and suggests avenues for curriculum-free learning strategies and future integration with SGD-based training, with practical implications for reducing overparameterization and computational cost.

Abstract

Dynamical loss functions are derived from standard loss functions used in supervised classification tasks, but are modified so that the contribution from each class periodically increases and decreases. These oscillations globally alter the loss landscape without affecting the global minima. In this paper, we demonstrate how to transform cross-entropy and mean squared error into dynamical loss functions. We begin by discussing the impact of increasing the size of the neural network or the learning rate on the depth and sharpness of the minima that the system explores. Building on this intuition, we propose several versions of dynamical loss functions and use a simple classification problem where we can show how they significantly improve validation accuracy for networks of varying sizes. Finally, we explore how the landscape of these dynamical loss functions evolves during training, highlighting the emergence of instabilities that may be linked to edge-of-instability minimization.

Dynamical loss functions shape landscape topography and improve learning in artificial neural networks

TL;DR

The paper addresses how oscillatory, class-wise weights in loss functions can reshape the loss landscape to improve learning in neural networks. It introduces dynamical cross-entropy with per-class oscillations and two dynamical MSE variants , governed by amplitude and period , preserving the global minima while altering optimization trajectories. Empirically, these dynamical losses yield higher validation accuracy than static losses, especially for small networks, and interact with curvature to produce period-doubling instabilities that enhance exploration in the loss landscape. The work links dynamical loss design to edge-of-stability minimization and suggests avenues for curriculum-free learning strategies and future integration with SGD-based training, with practical implications for reducing overparameterization and computational cost.

Abstract

Dynamical loss functions are derived from standard loss functions used in supervised classification tasks, but are modified so that the contribution from each class periodically increases and decreases. These oscillations globally alter the loss landscape without affecting the global minima. In this paper, we demonstrate how to transform cross-entropy and mean squared error into dynamical loss functions. We begin by discussing the impact of increasing the size of the neural network or the learning rate on the depth and sharpness of the minima that the system explores. Building on this intuition, we propose several versions of dynamical loss functions and use a simple classification problem where we can show how they significantly improve validation accuracy for networks of varying sizes. Finally, we explore how the landscape of these dynamical loss functions evolves during training, highlighting the emergence of instabilities that may be linked to edge-of-instability minimization.

Paper Structure

This paper contains 7 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: Interplay between network size and learning rate in the minimization of cross-entropy loss function. For three different network sizes and two different learning rates ($\eta$) we average the instantaneous values of the loss function and curvature (largest eigenvalue of the Hessian, $\lambda_{max}$) during the minimization of 20 simulations. The training dataset is presented in the inset of panel (a)---2D points belonging to three classes that follow the three arms of a spiral. Panel (a) shows the mean value of the loss, whereas (b) shows the mean value of the largest eigenvalue of the Hessian. Spikes during training (edge-of-stability minimization) are more clear in panel (a). Although panel (a) suggests that the value of the loss during training is controlled by NN size, panel (b) shows how smaller learning rates tend to explore narrower regions of the landscape. Even so, we see that for large network sizes the system tends to reach wider valleys for both learning rates, scaping edge-of-stability minimization. Panel (c) shows an idealized illustration of the loss function landscape that can qualitatively explain the behavior of panels (a) and (b). We represent a subspace of parameter space, that displays in its center a global minimum that occupies a finite region of parameter space. Valleys that lead to the global minimum need to increase their sharpness to fit into the boundary. Minimization in this landscape (green dots) can lead to instabilities due to the increase in curvature as the system approaches the minimum. During the instabilities the system can jump towards wider valleys and in some cases it can find one that is wide enough to smoothly descend into the global minimum (this would occur if the NN is large enough). Panel (c) is only a toy idealization to help gain intuition about learning in NNs.
  • Figure 2:
  • Figure 3: Visualizing the evolution of the system during one period doubling cascade. Panels (a)-(e) display the evolution of the system as it approaches the last part of the second period in Fig. \ref{['fig_2']} ($t=10000$), where instabilities emerge. We represent a subspace of the parameter space spanned by the eigenvectors associated with the two largest eigenvalues of the Hessian, $\vec{v}_1$ and $\vec{v}_2$, where $a$ and $b$ take values in $[-1,1]$. The background color displays the value of $\mathcal{F}$ around the point in parameter space where the system is at $t=8000$, $9285$, $9450$, $9560$ and $9600$ ($\vec{x}_t$, a red cross in the plots). We fix the background and plot green dots representing the position of the system at times $\{t, t+1, \dots, t+20 \}$, connecting them with semitransparent green arrows. We use a neural network with one hidden layer of width $100$ and full batch gradient descent, where $T=5000$ and $A=70$, chosen for ease of visualization.
  • Figure 4: Dynamical loss functions can train smaller networks. We use again a fully connected NN and the Swiss Roll dataset as in Fig. \ref{['fig_1']} (a). The validation dataset in both cases is similar to the training dataset but generated with a different seed leading to a different distribution of the points along the spiral. In panel (a) we compare CE and DCE with $\eta = 1$, and we average 50 simulations for each NN size. In panel (b) we compare MSE with the two versions of DMSE with $\eta = 0.075$, and average 10 simulations for each NN size. In both panels we change $A=1$ (static loss) for the last oscillation where we measure the accuracy, showing that dynamical loss functions take the system to a different region of parameter space (compared to their static versions) where validation accuracy improves. The errorbars display the error of the mean.
  • Figure 5: We represent the value of $\lambda_{\mathrm{max}}$ at which instabilities emerge (see Fig. \ref{['fig_2']} (d)) vs $\eta$. To show the robustness of the result, we simulate different conditions: $A = [25, 50, 75]$, $T = [5{,}000, 10{,}000]$ for a total time $T_{\mathrm{total}} = 100{,}000/\eta$, with $NN_{\mathrm{width}} = 100$. We represent the mean value of 30 simulations per condition. The dash line is the prediction from edge-of-stability minimization. The error bars display the error of the mean.