Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

Shijun Zhang; Jianfeng Lu; Hongkai Zhao

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

Shijun Zhang, Jianfeng Lu, Hongkai Zhao

TL;DR

This work establishes that the expressive power of deep neural networks is not limited to ReLU activations. By defining a broad activation-function set $\mathscr{A}$, it proves that any $\mathtt{ReLU}$ network with width $N$ and depth $L$ can be uniformly approximated on bounded domains by a $\varrho$-activated network of width $3N$ and depth $2L$, for any $\varrho\in\mathscr{A}$. It further shows that, for activations in special subsets of $\mathscr{A}$, the width-depth requirements can be sharpened to $(k{+}2)N{,}L$, $(2N,L)$, or even $(1,1)$, enhancing the practicality of using diverse activations without sacrificing approximation strength. The proofs hinge on an activation-replacement technique and finite-difference-based derivative approximations, with careful construction ensuring uniform convergence on bounded sets. Collectively, the results extend ReLU-based approximation theory to a wide class of activations, enabling broader theoretical guarantees and informing activation design choices in practice.

Abstract

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}^2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$.

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

TL;DR

This work establishes that the expressive power of deep neural networks is not limited to ReLU activations. By defining a broad activation-function set

, it proves that any

network with width

and depth

can be uniformly approximated on bounded domains by a

-activated network of width

and depth

, for any

. It further shows that, for activations in special subsets of

, the width-depth requirements can be sharpened to

, or even

, enhancing the practicality of using diverse activations without sacrificing approximation strength. The proofs hinge on an activation-replacement technique and finite-difference-based derivative approximations, with careful construction ensuring uniform convergence on bounded sets. Collectively, the results extend ReLU-based approximation theory to a wide class of activations, enabling broader theoretical guarantees and informing activation design choices in practice.

Abstract

This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set

is defined to encompass the majority of commonly used activation functions, such as

, and

. We demonstrate that for any activation function

, a

network of width

and depth

can be approximated to arbitrary precision by a

-activated network of width

and depth

on any bounded set. This finding enables the extension of most approximation results achieved with

networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,

depth) scaling factors can be further reduced from

falls within a specific subset of

. This subset includes activation functions such as

, and

Paper Structure (22 sections, 17 theorems, 50 equations, 2 figures, 2 tables)

This paper contains 22 sections, 17 theorems, 50 equations, 2 figures, 2 tables.

Introduction
Definition of Activation Function Set
Main Results
Further Discussions
Additional Results
Related Work
Definitions and Illustrations of Common Activation Functions
Proofs of Theorems in Sections \ref{['sec:intro']} and \ref{['sec:further:discussion']}
Notations
Propositions for Proving Theorems in Sections \ref{['sec:intro']} and \ref{['sec:further:discussion']}
Proof of Theorem \ref{['thm:main']} Based on Propositions
Proofs of Theorems in Section \ref{['sec:additional:theorems']} Based on Propositions
Proof of Proposition \ref{['prop:activation:replace']}
Proof of Proposition \ref{['prop:approx:f:nth:D']}
A Lemma for Proving Proposition \ref{['prop:approx:f:nth:D']}
...and 7 more sections

Key Result

Theorem 1

Suppose $\varrho\in \scrA$ and $\bmphi_\ReLU\in {\space\mathcal{N}\space\mathcal{N}\space}_{\space\ReLU\space} \{N,\space L;\space \R^{d}\space\to\space\R^{n}\}$ with $N,L,d,n\in\N^+$. Then for any $\varepsilon>0$ and $A>0$, there exists $\bmphi_\varrho\in {\space\mathcal{N}\space\mathcal{N}\sp

Figures (2)

Figure 1: Illustrations of how a single active neuron activated by $\varrho\in \widetilde{\mathscr{A}\mspace{2mu}}\mspace{-6.1mu}_2$ is adequate for approximating the activation function.
Figure 2: Illustrations of , , $\ReLU^2$, , , , , , , , , , , , , , and .

Theorems & Definitions (17)

Theorem 1
Corollary 2
Corollary 3
Corollary 4
Corollary 5
Theorem 6
Theorem 7
Theorem 8
Theorem 9
Proposition 10
...and 7 more

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

TL;DR

Abstract

Deep Network Approximation: Beyond ReLU to Diverse Activation Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (17)