Identifying Equivalent Training Dynamics

William T. Redman; Juan M. Bello-Rivas; Maria Fonoberova; Ryan Mohr; Ioannis G. Kevrekidis; Igor Mezić

Identifying Equivalent Training Dynamics

William T. Redman, Juan M. Bello-Rivas, Maria Fonoberova, Ryan Mohr, Ioannis G. Kevrekidis, Igor Mezić

TL;DR

A framework for identifying conjugate and non-conjugate training dynamics in DNN models that do and do not undergo grokking is developed and it is demonstrated that comparing Koopman eigenvalues can correctly identify a known equivalence between online mirror descent and online gradient descent.

Abstract

Study of the nonlinear evolution deep neural network (DNN) parameters undergo during training has uncovered regimes of distinct dynamical behavior. While a detailed understanding of these phenomena has the potential to advance improvements in training efficiency and robustness, the lack of methods for identifying when DNN models have equivalent dynamics limits the insight that can be gained from prior work. Topological conjugacy, a notion from dynamical systems theory, provides a precise definition of dynamical equivalence, offering a possible route to address this need. However, topological conjugacies have historically been challenging to compute. By leveraging advances in Koopman operator theory, we develop a framework for identifying conjugate and non-conjugate training dynamics. To validate our approach, we demonstrate that comparing Koopman eigenvalues can correctly identify a known equivalence between online mirror descent and online gradient descent. We then utilize our approach to: (a) identify non-conjugate training dynamics between shallow and wide fully connected neural networks; (b) characterize the early phase of training dynamics in convolutional neural networks; (c) uncover non-conjugate training dynamics in Transformers that do and do not undergo grokking. Our results, across a range of DNN architectures, illustrate the flexibility of our framework and highlight its potential for shedding new light on training dynamics.

Identifying Equivalent Training Dynamics

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 11 figures, 2 tables, 3 algorithms)

This paper contains 32 sections, 8 equations, 11 figures, 2 tables, 3 algorithms.

Introduction
Related work
Identification of DNN training dynamics phenomena
Koopman operator theory applied to DNN training
Identifying equivalent training dynamics
Topological conjugacy
Koopman mode decomposition
Equivalent Koopman spectra implies topological conjugacy
Results
Identifying conjugate optimizers
Identifying the effect of width on fully connected neural network training
Identifying dynamical transitions in convolutional neural network training
Identifying non-conjugate training dynamics for Transformers that do and do not grok
Discussion
Online mirror and online gradient descent
...and 17 more sections

Figures (11)

Figure 1: Schematic of Koopman operator theory-based identification of conjugate dynamical systems. (A) By lifting nonlinear dynamics from a finite dimensional state-space to an infinite dimensional function space, a linear representation can be achieved (from which a finite dimensional approximation can be obtained). (B) The linearity of the Koopman operator enables a mode decomposition, which includes Koopman eigenvalues (orange), eigenfunctions (green), and modes (blue). (C) Dynamical systems with the same Koopman eigenvalues are topologically conjugate.
Figure 2: Conjugacy between online mirror descent and online gradient descent is identifiable from Koopman spectra. (A) Comparing example trajectories of variables optimized via OMD ($x_1, x_2$), OGD ($u_1, u_2$), and BM ($z_1, z_2$), the existence of a conjugacy between OMD and OGD is not obvious. (B) Similarly, the existence of a conjugacy is not apparent when looking at the loss incurred by using OMD and OGD. (C) Comparing the Koopman eigenvalues associated with optimizing using OMD, OGD, and BM correctly identifies the existence of a conjugacy between OMD and OGD, and the lack of a conjugacy between OMD/OGD and BM. The function optimized is in all subfigures is $f(x) = \sum \tan(x)$.
Figure 3: Narrow and wide fully connected neural networks have non-conjugate training dynamics. (A) Training loss curves for FCNs with hidden layer widths $h = 5, 10,$ and $40$. Solid line is mean and shaded area is $\pm$ standard deviation across $25$ independently trained networks. (B), (C) Example weight trajectories, across training iterations, for narrow, intermediate, and wide FCNs. (D) Koopman eigenvalues associated with training FCNs of varying width. (E) Same as (D), but zoomed out and with the eigenvalues associated with $h = 5$ and $h = 10$ compared to those associated with $h = 40$. Dashed line in (D) and (E) denotes unit circle. (F) Wasserstein distance between Koopman eigenvalues associated with training FCNs of varying width. Error bars are $\pm$ standard deviation across $25$ independently trained FCNs. Kolmogorov–Smirnov (KS) tests were performed to assess statistical significance of distance: $*$ denotes $p < 0.01$ and $***$ denotes $p < 0.0001$.
Figure 4: Koopman-based framework enables identification of transitions in dynamics during the early phase of training for LeNet and ResNet-20. (A) Log$_{10}$ Wasserstein distance between Koopman eigenvalues associated with LeNet training over windows of 100 training iterations during epoch 1. (B) Same as (A), but for ResNet-20 training. (C) Koopman eigenvalues associated with the dynamics that occur during training iterations intervals 0--99, 400--499, and 600--699. Dashed line denotes the unit circle.
Figure 5: Transformers that do, and that do not undergo grokking have early training dynamics that are not conjugate. (A) Train and test loss, as a function of training steps, for a Transformer model that undergoes grokking. (B) Same as (A), but for a Transformer whose training is constrained to have a constant weight norm liu2022omnigrok. In this case, no grokking is observed. (C) In the first 100 training steps, little difference is seen between the test loss of Transformers with and without constrained training. Lines are mean and shaded area is $\pm$ standard deviation across 20 independently trained networks. (D) Koopman eigenvalues associated with the dynamics that occur over the first 100 training iterations for Transformers that do, and that not undergo grokking.
...and 6 more figures

Identifying Equivalent Training Dynamics

TL;DR

Abstract

Identifying Equivalent Training Dynamics

Authors

TL;DR

Abstract

Table of Contents

Figures (11)