Table of Contents
Fetching ...

Emergence of Globally Attracting Fixed Points in Deep Neural Networks With Nonlinear Activations

Amir Joudaki, Thomas Hofmann

TL;DR

A theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs and reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture.

Abstract

Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.

Emergence of Globally Attracting Fixed Points in Deep Neural Networks With Nonlinear Activations

TL;DR

A theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs and reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture.

Abstract

Understanding how neural networks transform input data across layers is fundamental to unraveling their learning and generalization capabilities. Although prior work has used insights from kernel methods to study neural networks, a global analysis of how the similarity between hidden representations evolves across layers remains underexplored. In this paper, we introduce a theoretical framework for the evolution of the kernel sequence, which measures the similarity between the hidden representation for two different inputs. Operating under the mean-field regime, we show that the kernel sequence evolves deterministically via a kernel map, which only depends on the activation function. By expanding activation using Hermite polynomials and using their algebraic properties, we derive an explicit form for kernel map and fully characterize its fixed points. Our analysis reveals that for nonlinear activations, the kernel sequence converges globally to a unique fixed point, which can correspond to orthogonal or similar representations depending on the activation and network architecture. We further extend our results to networks with residual connections and normalization layers, demonstrating similar convergence behaviors. This work provides new insights into the implicit biases of deep neural networks and how architectural choices influence the evolution of representations across layers.

Paper Structure

This paper contains 21 sections, 9 theorems, 40 equations, 3 figures, 1 table.

Key Result

Proposition 1

In the mean-field regime with ${d \to \infty}$, let $\rho_\ell$ denote the kernel sequence of an MLP with activation function $\phi$ obeying $\mathbb{E}\,\phi(X)^2=1$ for $X\sim \mathcal{N}(0,1)$. If each element of the weights drawn i.i.d. from a distribution with zero mean and unit variance, the k where the initial value $\rho_0$ corresponds to the input, and $\kappa$ is defined in Definition de

Figures (3)

  • Figure S.1: Validation of Theorem \ref{['thm:global_attract']} Each row corresponds to an activation, scaled down by a factor $C$ to obey $\mathbb{E}\, \phi(X)^2=1.$. From op to bottom: relu, exp, gelu, tanh. From left, the first column shows the activation, second column shows kernel map, third column shows the kernel sequence vs depth along with the theory prediction, and fourth column shows the distance to the fixed points in for theory and empirical kernels. (Remainder on the next page)
  • Figure S.2: Continuation of Figure\ref{['fig:validation_plots']}, for activations elu, celu, sigmoid, and selu. for the particular case of sigmoid, the errors fall below the numerical precision and cannot be computed.
  • Figure S.3: Continuation of Figure\ref{['fig:validation_plots']}, for LeakyReLU with various negative slopes.

Theorems & Definitions (19)

  • Definition 1
  • Proposition 1
  • Definition 2
  • Definition 3
  • Lemma 1: Mehler's lemma
  • Corollary 1
  • Theorem 1
  • Corollary 2
  • Proposition 2
  • Proposition 3
  • ...and 9 more