Table of Contents
Fetching ...

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

Ashkan Panahi

TL;DR

A non-asymptotic result is presented, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze and an iterative refinement scheme is suggested to obtain more accurate expressions in non-asymptotic scenarios.

Abstract

We study training algorithms with data following a Gaussian mixture model. For a specific family of such algorithms, we present a non-asymptotic result, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze. The proof of our result is based on the celebrated Gordon comparison theorem. Using our theorem, we rigorously prove the validity of the dynamic mean-field (DMF) expressions in the asymptotic scenarios. Moreover, we suggest an iterative refinement scheme to obtain more accurate expressions in non-asymptotic scenarios. We specialize our theory to the analysis of training a perceptron model with a generic first-order (full-batch) algorithm and demonstrate that fluctuation parameters in a non-asymptotic domain emerge in addition to the DMF kernels.

A Gaussian Comparison Theorem for Training Dynamics in Machine Learning

TL;DR

A non-asymptotic result is presented, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze and an iterative refinement scheme is suggested to obtain more accurate expressions in non-asymptotic scenarios.

Abstract

We study training algorithms with data following a Gaussian mixture model. For a specific family of such algorithms, we present a non-asymptotic result, connecting the evolution of the model to a surrogate dynamical system, which can be easier to analyze. The proof of our result is based on the celebrated Gordon comparison theorem. Using our theorem, we rigorously prove the validity of the dynamic mean-field (DMF) expressions in the asymptotic scenarios. Moreover, we suggest an iterative refinement scheme to obtain more accurate expressions in non-asymptotic scenarios. We specialize our theory to the analysis of training a perceptron model with a generic first-order (full-batch) algorithm and demonstrate that fluctuation parameters in a non-asymptotic domain emerge in addition to the DMF kernels.
Paper Structure (27 sections, 10 theorems, 72 equations, 3 figures, 1 algorithm)

This paper contains 27 sections, 10 theorems, 72 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

For every $\sigma>0$ and $z\in\mathbb{R}$, the solutions $\xi_\psi$ and $\xi'_\phi$ have identical distributions. In other words, for any measurable function $h:\mathcal{B}\to\mathbb{R}$, we have

Figures (3)

  • Figure 1: Training error for gradient descent ($s=0, t=.2$) with two classes with $\rho(y)=.5, \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and varying coupling$=v(0,1)$.
  • Figure 2: Training error for momentum gradient descent ($t=.2$) with varying forgetting factor $s$ and two classes with $\rho(y)=.5,\ \gamma=1,\ \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and coupling $v(0,1)=-.5$.
  • Figure 3: Normalized variance of training a perceptron with soft ReLU function, and two classes with $\rho(y)=.5,\ \gamma=1,\ \|\theta_0\|=.1$, and overlaps: $v(y,y)=1$ and coupling $v(0,1)=-.5$. The empirical values are calculated by averaging over $10^5$ realizations.

Theorems & Definitions (28)

  • Theorem 1
  • Claim 1
  • Theorem 2
  • proof
  • Theorem 3: Extended Gordon Lemma
  • proof
  • Proposition 1
  • Definition 1
  • Theorem 4
  • proof
  • ...and 18 more