Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

Zenan Ling; Longbo Li; Zhanbo Feng; Yixuan Zhang; Feng Zhou; Robert C. Qiu; Zhenyu Liao

Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

Zenan Ling, Longbo Li, Zhanbo Feng, Yixuan Zhang, Feng Zhou, Robert C. Qiu, Zhenyu Liao

TL;DR

This work uses random matrix theory to analyze the spectral behavior of CK and NTK in Deep Equilibrium Models under high-dimensional Gaussian mixtures. It shows that the CK/NTK depend on the activation and variance through a small set of nonlinear equations, enabling a principled route to create equivalent shallow explicit networks via CK/NTK matching. The authors provide a practical recipe to design activations for explicit nets that replicate the DEQ spectral properties, validated on Gaussian mixtures and real datasets such as MNIST, Fashion-MNIST, and CIFAR-10. Overall, the paper establishes a high-dimensional equivalence between implicit DEQs and shallow explicit networks, offering substantial potential computational savings and guidance for network design.

Abstract

Deep equilibrium models (DEQs), as a typical implicit neural network, have demonstrated remarkable success on various tasks. There is, however, a lack of theoretical understanding of the connections and differences between implicit DEQs and explicit neural network models. In this paper, leveraging recent advances in random matrix theory (RMT), we perform an in-depth analysis on the eigenspectra of the conjugate kernel (CK) and neural tangent kernel (NTK) matrices for implicit DEQs, when the input data are drawn from a high-dimensional Gaussian mixture. We prove, in this setting, that the spectral behavior of these Implicit-CKs and NTKs depend on the DEQ activation function and initial weight variances, but only via a system of four nonlinear equations. As a direct consequence of this theoretical result, we demonstrate that a shallow explicit network can be carefully designed to produce the same CK or NTK as a given DEQ. Despite derived here for Gaussian mixture data, empirical results show the proposed theory and design principle also apply to popular real-world datasets.

Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

TL;DR

Abstract

Paper Structure (40 sections, 7 theorems, 97 equations, 8 figures, 2 tables)

This paper contains 40 sections, 7 theorems, 97 equations, 8 figures, 2 tables.

Introduction
Our Contributions
Related Works
Neural tangent kernels.
Over-parameterized DEQs.
Random matrix theory and NNs.
Preliminaries
Notations.
Main Results
High-dimensional Characterization of Implicit-CK and NTK Matrices
High-dimensional Equivalence between DEQs and Shallow Explicit Networks
A brief review of Explicit CKs and NTKs
Designing Equivalent Explicit NNs via CK matching
Experiments
High-dimensional approximations of Implicit-CKs and NTKs.
...and 25 more sections

Key Result

Proposition 2.6

Under Assumptions assum:initial-cond:G*, the Implicit-CK of the DEQ model in def:deq takes the following form: where the $(i,j)$-th entry of ${\bm{G}}^{(l)}$ is defined recursively asNote that the expectation is conditioned on the input data, and is taken with respect to the random weights.${\bm{G}}^{(l)}_{ij} = \mathbb{E}[({\bm{z}}_i^{(l)})^\top{\bm{z}}_j^{(l)}]$, i.e., ${\bm{G}}^{(0)}_{ij} = ({

Figures (8)

Figure 1: Evolution of relative spectral norm error $\|{\bm{G}}^*-\overline{{\bm{G}}}\|/\|{\bm{G}}^*\|$w.r.t. sample size $n$, for DEQs in \ref{['def:deq']} with different activations and $\sigma_a^2=0.2$, on two-class GMM, $p/n = 0.8$, ${\bm{\mu}}_a=[\mathbf{0}_{8(a-1)};8;\mathbf{0}_{p-8a+7}]$, and ${\bm{C}}_a=(1+8(a-1)/\sqrt{p}){\bm{I}}_p, a\in \{1,2\}$.
Figure 2: Left: Visualization of activations of DEQs (dashed) and those of equivalent explicit NNs (solid). Right: Evolution of relative spectral norm errors $\|{\bm{G}}_{\text{Tanh}}^*-{\boldsymbol{\Sigma}}_{\text{H-Tanh}}^{(1)}\|/ \|{\bm{G}}_{\text{Tanh}}^*\|$ and $\|{\bm{G}}_{\text{ReLU}}^*-{\boldsymbol{\Sigma}}_{\text{L-ReLU}}^{(2)}\|/ \|{\bm{G}}_{\text{ReLU}}^*\|$w.r.t. sample size $n$ on GMM as in \ref{['fig:thm1']} for Example \ref{['exm:Tanh']} ( red) and Example \ref{['exm:ReLU']} ( blue), respectively.
Figure 3: Classification accuracies of implicit DEQs and explicit models trained with SGD. Top: Evolution of classification accuracies w.r.t. the width $m$ of Tanh-DEQ ( green), the corresponding equivalent explicit H-Tanh-ENN ( blue), and Tanh-ENN ( red). Bottom: Evolution of classification accuracies w.r.t. the width $m$ of ReLU-DEQ ( green), the corresponding equivalent explicit L-ReLU-ENN ( blue), and ReLU-ENN ( red). For MNIST (left) and Fashion-MNIST datasets (middle), raw data are taken as the network input; for CIFAR-10 dataset (right) , flattened output of the 16th convolutional layer of VGG-19 are used.
Figure 3: Evolution of relative spectral norm error $\|{\bm{K}}^*-\overline{{\bm{K}}}\| / \|{\bm{K}}^*\|$w.r.t. sample size $n$, for DEQs in \ref{['def:deq']} with different activations and $\sigma_a^2=0.2$, on two-class GMM, $p/n = 0.8$, ${\bm{\mu}}_a=[\mathbf{0}_{8(a-1)};8;\mathbf{0}_{p-8a+7}]$, and ${\bm{C}}_a=(1+8(a-1)/\sqrt{p}){\bm{I}}_p, a\in \{1,2\}$. Implicit-NTK matrices ${\bm{K}}^*$ defined in Eq. (\ref{['eq:imntk']}) are taken with expectation estimated from DEQs with random ${\bm{A}}$ and ${\bm{B}}$ of width $m=2^{12}$. The asymptotic equivalent matrices $\overline{{\bm{K}}}$ are obtained by Theorem \ref{['thm:imNTK']}.
Figure 3: Eigenvalue density of Implicit-CK matrices ( blue) ${\bm{G}}_{\text{ReLU}}^{*}$ of ReLU-DEQ (top) and ${\bm{G}}_{\text{Tanh}}^{*}$ of Tanh-DEQ (bottom) and the corresponding high dimensional approximation $\overline{{\bm{G}}}_{\text{ReLU}}$ and $\overline{{\bm{G}}}_{\text{Tanh}}$ ( red) , on two-class GMM data (left) with $p=1\,000$, $n=800$, ${\bm{\mu}}_a=[\mathbf{0}_{8(a-1)};8;\mathbf{0}_{p-8a+7}]$, ${\bm{C}}_a=(1+8(a-1)/\sqrt{p}){\bm{I}}_p$, for $a\in \{1,2\}$, here $\|{\bm{G}}_{\text{ReLU}}^*-\overline{{\bm{G}}}_{\text{ReLU}}\|\approx 0.26$ and $\|{\bm{G}}_{\text{Tanh}}^*-\overline{{\bm{G}}}_{\text{Tanh}}\|\approx 0.81$; and on two-class MNIST data (right) (number $6$ versus number $8$), with $p=784$, $n=3\,000$, for which $\|{\bm{G}}_{\text{ReLU}}^*-\overline{{\bm{G}}}_{\text{ReLU}}\|\approx 1.80$ and $\|{\bm{G}}_{\text{Tanh}}^*-\overline{{\bm{G}}}_{\text{Tanh}}\|\approx 2.02$. For the MNIST case, small eigenvalues close to zero are removed for better visualization.
...and 3 more figures

Theorems & Definitions (17)

Definition 2.1: Deep equilibrium model, DEQ
Remark 2.5: On CKs and NTKs
Proposition 2.6: Implicit-CKs and NTKs of DEQ, feng2020neuralgao2023wide
Remark 2.8: On GMM in \ref{['def:gmm']}
Remark 3.2: Existence and uniqueness of $\tau_*$
Theorem 3.3: High-dimensional approximation of Implicit-CKs
Theorem 3.4: High-dimensional approximation of Implicit-NTKs
Remark 3.5: On centered activation
Definition 3.6: Fully-connected explicit NN model
Proposition 3.7: Explicit-CKs and NTKs, jacot2018neuralfan2020spectra
...and 7 more

Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

TL;DR

Abstract

Deep Equilibrium Models are Almost Equivalent to Not-so-deep Explicit Models for High-dimensional Gaussian Mixtures

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (17)