Global Convergence Rate of Deep Equilibrium Models with General Activations

Lan V. Truong

Global Convergence Rate of Deep Equilibrium Models with General Activations

Lan V. Truong

TL;DR

The paper studies over-parameterized Deep Equilibrium Models (DEQs) with general activations and proves that gradient descent converges linearly to the global minimum of the quadratic loss Φ(θ) = 1/2 ||ŷ(θ) − y||^2, provided the width $m$ is large enough and the initial Gram eigenvalue $oldsymbol{ extlambda}_0$ is bounded via a population Gram matrix $oldsymbol{K}$ with strictly positive least eigenvalue $oldsymbol{ extlambda}_*$. A novel population Gram matrix is constructed using a new dual activation $ ilde{Q}_{oldsymbol{oldalpha},oldsymbol{etab}}$ and Hermite polynomial expansion to ensure $oldsymbol{K}$ is symmetric positive definite; concentration arguments then relate $oldsymbol{G}$ to $oldsymbol{K}$, yielding a lower bound $oldsymbol{ extlambda}_0 aisebox{0.5ex}{$uildrel ext{= extstyle}rown$}$ (m/2)$oldsymbol{ extlambda}_*$ with $m= ilde{oldsymbol{O}}(n^2/oldsymbol{ extlambda}_*^2)$, and a linear convergence rate $oldsymbol{ m abla}oldsymbol{ m abla}$, namely $oldsymbol{ extPhi}(oldsymbol{ heta}( au)) \, extleq \, (1- oldsymbol{ abla}rac{oldsymbol{ extlambda}_0}{2})^{ au}oldsymbol{ extPhi}(oldsymbol{ heta}(0))$. The WIAL algorithm provides a practical initialization scheme, and experiments on MNIST/CIFAR-10 corroborate linear convergence across several activations, including non-homogeneous ones. Overall, the work broadens the applicability of DEQs by establishing provable linear convergence for a broad class of activation functions and offering a concrete initialization procedure.

Abstract

In a recent paper, Ling et al. investigated the over-parametrized Deep Equilibrium Model (DEQ) with ReLU activation. They proved that the gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. This paper shows that this fact still holds for DEQs with any general activation that has bounded first and second derivatives. Since the new activation function is generally non-homogeneous, bounding the least eigenvalue of the Gram matrix of the equilibrium point is particularly challenging. To accomplish this task, we need to create a novel population Gram matrix and develop a new form of dual activation with Hermite polynomial expansion.

Global Convergence Rate of Deep Equilibrium Models with General Activations

TL;DR

is large enough and the initial Gram eigenvalue

is bounded via a population Gram matrix

with strictly positive least eigenvalue

. A novel population Gram matrix is constructed using a new dual activation

and Hermite polynomial expansion to ensure

is symmetric positive definite; concentration arguments then relate

, yielding a lower bound

uildrel ext{= extstyle}rown

(m/2)

with

, and a linear convergence rate

, namely

. The WIAL algorithm provides a practical initialization scheme, and experiments on MNIST/CIFAR-10 corroborate linear convergence across several activations, including non-homogeneous ones. Overall, the work broadens the applicability of DEQs by establishing provable linear convergence for a broad class of activation functions and offering a concrete initialization procedure.

Abstract

Paper Structure (18 sections, 18 theorems, 224 equations, 2 figures)

This paper contains 18 sections, 18 theorems, 224 equations, 2 figures.

Introduction
Problem setting
Main results
A novel design of the population Gram matrix $\mathbf{K}$
Proof of Theorem \ref{['thm13']}
Checking the conditions of Theorem \ref{['theom1']}
Weight Initialisation Algorithm
Numerical Results
Conclusion
Proof of Lemma \ref{['lemesy']}
Proof of Lemma \ref{['lemau2']}
Proof of Proposition \ref{['pro:G']}
Proof of Proposition \ref{['prop9']}
Proof of Proposition \ref{['prop12']}
Proof of Theorem \ref{['theom1']}
...and 3 more sections

Key Result

Theorem 3

Consider a DEQ. Let $\delta$ be a constant such that $\|\mathbf{W}(0)\|_2+\delta<1/L$. Denote by $\bar{\rho}_w=\|\mathbf{W}(0)\|_2+\delta, \bar{\rho}_u=\|\mathbf{U}(0)\|_2+\delta, \bar{\rho}_a=\|\mathbf{a}(0)\|_2+\delta$ and define In addition, assume at initialization that where $\lambda_0$ is the least eigenvalue of $\mathbf{G}(0)=\mathbf{T}(0)^T \mathbf{T}(0)$. Then, if the learning rate sati

Figures (2)

Figure 1: Training dynamics at different values of $m$.
Figure 2: Training dynamics for different activation functions.

Theorems & Definitions (28)

Definition 1
Definition 2
Theorem 3
Lemma 4
Definition 5
Definition 6
Theorem 7
Theorem 8
Lemma 9
Lemma 10
...and 18 more

Global Convergence Rate of Deep Equilibrium Models with General Activations

TL;DR

Abstract

Global Convergence Rate of Deep Equilibrium Models with General Activations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (28)