Table of Contents
Fetching ...

The Optimal Condition Number for ReLU Function

Yu Xia, Haoyu Zhou

TL;DR

This work establishes fundamental limits and optimality results for the stability of the ReLU map in single neural network layers. It proves a universal lower bound $β_{A,b}\ge\sqrt{2}$ and shows that Gaussian random weights with zero bias asymptotically attain this bound, implying distance-preserving behavior in wide random networks. A general cone-based framework (Theorem Lipschitz_Result) provides explicit bi-Lipschitz bounds for Gaussian matrices with sample complexity depending on the Gaussian width $ω((S-S)\cap\mathbb{B}^n)$, improving prior ω⁴ dependencies to ω²; the analysis combines large- and small-distance regimes to bound $\frac{1}{m}\|σ(Ax)-σ(Ay)\|_2^2$ in terms of $\|x-y\|_2^2$ and an angular term φ(x,y). The results theoretically justify Gaussian initialization as distance-preserving and connect random-weight propagation to precise geometric and probabilistic tools, offering rigorous foundations for stable signal propagation in deep networks.

Abstract

ReLU is a widely used activation function in deep neural networks. This paper explores the stability properties of the ReLU map. For any weight matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ and bias vector $\boldsymbol{b} \in \mathbb{R}^{m}$ at a given layer, we define the condition number $β_{\boldsymbol{A},\boldsymbol{b}}$ as $β_{\boldsymbol{A},\boldsymbol{b}} = \frac{\mathcal{U}_{\boldsymbol{A},\boldsymbol{b}}}{\mathcal{L}_{\boldsymbol{A},\boldsymbol{b}}}$, where $\mathcal{U}_{\boldsymbol{A},\boldsymbol{b}}$ and $\mathcal{L}_{\boldsymbol{A},\boldsymbol{b}}$ are the upper and lower Lipschitz constants, respectively. We first demonstrate that for any given $\boldsymbol{A}$ and $\boldsymbol{b}$, the condition number satisfies $β_{\boldsymbol{A},\boldsymbol{b}} \geq \sqrt{2}$. Moreover, when the weights of the network at a given layer are initialized as random i.i.d. Gaussian variables and the bias term is set to zero, the condition number asymptotically approaches this lower bound. This theoretical finding suggests that Gaussian weight initialization is optimal for preserving distances in the context of random deep neural network weights.

The Optimal Condition Number for ReLU Function

TL;DR

This work establishes fundamental limits and optimality results for the stability of the ReLU map in single neural network layers. It proves a universal lower bound and shows that Gaussian random weights with zero bias asymptotically attain this bound, implying distance-preserving behavior in wide random networks. A general cone-based framework (Theorem Lipschitz_Result) provides explicit bi-Lipschitz bounds for Gaussian matrices with sample complexity depending on the Gaussian width , improving prior ω⁴ dependencies to ω²; the analysis combines large- and small-distance regimes to bound in terms of and an angular term φ(x,y). The results theoretically justify Gaussian initialization as distance-preserving and connect random-weight propagation to precise geometric and probabilistic tools, offering rigorous foundations for stable signal propagation in deep networks.

Abstract

ReLU is a widely used activation function in deep neural networks. This paper explores the stability properties of the ReLU map. For any weight matrix and bias vector at a given layer, we define the condition number as , where and are the upper and lower Lipschitz constants, respectively. We first demonstrate that for any given and , the condition number satisfies . Moreover, when the weights of the network at a given layer are initialized as random i.i.d. Gaussian variables and the bias term is set to zero, the condition number asymptotically approaches this lower bound. This theoretical finding suggests that Gaussian weight initialization is optimal for preserving distances in the context of random deep neural network weights.

Paper Structure

This paper contains 14 sections, 15 theorems, 172 equations.

Key Result

Theorem 2.1

For any matrix $\boldsymbol{A}\in \mathbb{R}^{m\times n}$ and vector ${\boldsymbol b}\in \mathbb{R}^{m}$, the condition number $\beta_{\boldsymbol{A},{\boldsymbol b}}$, as formulated in (beta_def), satisfies the following inequalities:

Theorems & Definitions (34)

  • Theorem 2.1
  • proof
  • Theorem 2.2
  • proof
  • Definition 3.1: Gaussian width
  • Definition 3.2: $\epsilon$-Net
  • Theorem 3.1
  • proof
  • Remark 3.2
  • proof : Proof of Theorem \ref{['thm: Gaussian_bilip']}
  • ...and 24 more