Table of Contents
Fetching ...

On Dissipativity of Cross-Entropy Loss in Training ResNets

Jens Püttschneider, Timm Faulwasser

TL;DR

The paper casts ResNet and neural ODE training as finite-horizon optimal control problems and develops a dissipativity-based analysis using a soft-cross-entropy regularization. It proves strict dissipativity with respect to a subspace of soft-cross-entropy minimizers, establishing a turnpike property that concentrates optimal trajectories near these minimizers for most of the horizon. Neural ODE extensions and equilibria-preserving discretizations are discussed, and the approach is validated on the two spirals and MNIST datasets, showing that training concentrates near per-class minimizer subspaces and enabling depth cropping. Overall, the framework provides a principled method to determine minimal necessary depth and to understand training dynamics through infinite-horizon-inspired concepts applied to finite-depth networks.

Abstract

The training of ResNets and neural ODEs can be formulated and analyzed from the perspective of optimal control. This paper proposes a dissipative formulation of the training of ResNets and neural ODEs for classification problems by including a variant of the cross-entropy as a regularization in the stage cost. Based on the dissipative formulation of the training, we prove that the trained ResNet exhibit the turnpike phenomenon. We then illustrate that the training exhibits the turnpike phenomenon by training on the two spirals and MNIST datasets. This can be used to find very shallow networks suitable for a given classification task.

On Dissipativity of Cross-Entropy Loss in Training ResNets

TL;DR

The paper casts ResNet and neural ODE training as finite-horizon optimal control problems and develops a dissipativity-based analysis using a soft-cross-entropy regularization. It proves strict dissipativity with respect to a subspace of soft-cross-entropy minimizers, establishing a turnpike property that concentrates optimal trajectories near these minimizers for most of the horizon. Neural ODE extensions and equilibria-preserving discretizations are discussed, and the approach is validated on the two spirals and MNIST datasets, showing that training concentrates near per-class minimizer subspaces and enabling depth cropping. Overall, the framework provides a principled method to determine minimal necessary depth and to understand training dynamics through infinite-horizon-inspired concepts applied to finite-depth networks.

Abstract

The training of ResNets and neural ODEs can be formulated and analyzed from the perspective of optimal control. This paper proposes a dissipative formulation of the training of ResNets and neural ODEs for classification problems by including a variant of the cross-entropy as a regularization in the stage cost. Based on the dissipative formulation of the training, we prove that the trained ResNet exhibit the turnpike phenomenon. We then illustrate that the training exhibits the turnpike phenomenon by training on the two spirals and MNIST datasets. This can be used to find very shallow networks suitable for a given classification task.
Paper Structure (16 sections, 11 theorems, 66 equations, 5 figures, 1 table)

This paper contains 16 sections, 11 theorems, 66 equations, 5 figures, 1 table.

Key Result

Lemma 2

The stage cost eq:stagecost has no minimizers in $\mathbb{R}^{C\cdot D}$.

Figures (5)

  • Figure 1: Illustration of the soft cross-entropy and its minimizer set for two classes with the target class $y=1$.
  • Figure 2: Two Spirals dataset.
  • Figure 3: Evolution of the state trajectories for the two classes of the two spirals dataset.
  • Figure 4: State of the data trajectories int he last layer and the sets of soft-cross entropy minimizers for the two classes, $\mathbb{X}^\star_{1}$ and $\mathbb{X}^\star_{2}$.
  • Figure 5: The loss over the layers of the ResNet for the MNIST dataset in linear and logarithmic scale. The straight line represents the training loss and the dashed line represents the test loss.

Theorems & Definitions (26)

  • Definition 1: Strict dissipativity in discrete time
  • Lemma 2: No minimizers for cross-entropy
  • proof
  • Lemma 3: Minimizers of soft cross-entropy
  • proof
  • Remark 4: Large data with $\dim x > C$
  • Lemma 5: Invariance of soft cross-entropy
  • proof
  • Lemma 6: $T$ preserves the distance to $\mathbb{X}^\star_{y}$
  • proof
  • ...and 16 more