A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

Ashkan Panahi

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

Ashkan Panahi

TL;DR

A non-asymptotic result is presented, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze and an iterative refinement scheme is suggested to obtain more accurate expressions in non-asymptotic scenarios.

Abstract

We study training algorithms with data following a Gaussian mixture model. For a specific family of such algorithms, we present a non-asymptotic result, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze. The proof of our result is based on the celebrated Gordon comparison theorem. Using our theorem, we rigorously prove the validity of the dynamic mean-field (DMF) expressions in the asymptotic scenarios. Moreover, we suggest an iterative refinement scheme to obtain more accurate expressions in non-asymptotic scenarios. We specialize our theory to the analysis of training a perceptron model with a generic first-order (full-batch) algorithm and demonstrate that fluctuation parameters in a non-asymptotic domain emerge in addition to the DMF kernels.

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

TL;DR

Abstract

Paper Structure (27 sections, 10 theorems, 72 equations, 3 figures, 1 algorithm)

This paper contains 27 sections, 10 theorems, 72 equations, 3 figures, 1 algorithm.

Introduction
Problem formulation
Data Model
Training Algorithms
Main Result
Alternative Process
Main Theorem
Claim: Elimination of $\sigma,z$
Approximation Scheme in Large Problems
Initialization: Dynamic Mean Field Approximation
Proof of Theorem \ref{['theorem:main']}
Gordon's Comparison Lemma
Gordon Lemma for zeros of Gaussian Processes
Proof of Theorem \ref{['theorem:main']}
Example: Classification with Perceptron
...and 12 more sections

Key Result

Theorem 1

For every $\sigma>0$ and $z\in\mathbb{R}$, the solutions $\xi_\psi$ and $\xi'_\phi$ have identical distributions. In other words, for any measurable function $h:\mathcal{B}\to\mathbb{R}$, we have

Figures (3)

Figure 1: Training error for gradient descent ($s=0, t=.2$) with two classes with $\rho(y)=.5, \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and varying coupling$=v(0,1)$.
Figure 2: Training error for momentum gradient descent ($t=.2$) with varying forgetting factor $s$ and two classes with $\rho(y)=.5,\ \gamma=1,\ \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and coupling $v(0,1)=-.5$.
Figure 3: Normalized variance of training a perceptron with soft ReLU function, and two classes with $\rho(y)=.5,\ \gamma=1,\ \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and coupling $v(0,1)=-.5$. The empirical values are calculated by averaging over $10^5$ realizations.

Theorems & Definitions (28)

Theorem 1
Claim 1
Theorem 2
proof
Theorem 3: Extended Gordon Lemma
proof
Proposition 1
Definition 1
Theorem 4
proof
...and 18 more

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

TL;DR

Abstract

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (28)