Fixed width treelike neural networks capacity analysis -- generic activations

Mihailo Stojnic

Fixed width treelike neural networks capacity analysis -- generic activations

Mihailo Stojnic

TL;DR

This work tackles the memory capacity of 1-hidden-layer treelike neural networks with activations beyond the typical sign function. It extends Random Duality Theory (RDT) and introduces partially lifted RDT (pl RDT) to derive capacity bounds for linear, quadratic, and ReLU hidden activations, with results validated against replica theory. A key finding is that the capacity decreases with the hidden width $d$ for quadratic and ReLU activations, attaining its maximum at $d=2$, and that large-$d$ limits agree with statistical physics predictions. The findings have practical implications for architectural design, suggesting that increasing hidden width does not always enhance memory capacity, and they open pathways for applying refined RDT methods to broader activation functions and deeper networks.

Abstract

We consider the capacity of \emph{treelike committee machines} (TCM) neural networks. Relying on Random Duality Theory (RDT), \cite{Stojnictcmspnncaprdt23} recently introduced a generic framework for their capacity analysis. An upgrade based on the so-called \emph{partially lifted} RDT (pl RDT) was then presented in \cite{Stojnictcmspnncapliftedrdt23}. Both lines of work focused on the networks with the most typical, \emph{sign}, activations. Here, on the other hand, we focus on networks with other, more general, types of activations and show that the frameworks of \cite{Stojnictcmspnncaprdt23,Stojnictcmspnncapliftedrdt23} are sufficiently powerful to enable handling of such scenarios as well. In addition to the standard \emph{linear} activations, we uncover that particularly convenient results can be obtained for two very commonly used activations, namely, the \emph{quadratic} and \emph{rectified linear unit (ReLU)} ones. In more concrete terms, for each of these activations, we obtain both the RDT and pl RDT based memory capacities upper bound characterization for \emph{any} given (even) number of the hidden layer neurons, $d$. In the process, we also uncover the following two, rather remarkable, facts: 1) contrary to the common wisdom, both sets of results show that the bounding capacity decreases for large $d$ (the width of the hidden layer) while converging to a constant value; and 2) the maximum bounding capacity is achieved for the networks with precisely \textbf{\emph{two}} hidden layer neurons! Moreover, the large $d$ converging values are observed to be in excellent agrement with the statistical physics replica theory based predictions.

Fixed width treelike neural networks capacity analysis -- generic activations

TL;DR

for quadratic and ReLU activations, attaining its maximum at

, and that large-

limits agree with statistical physics predictions. The findings have practical implications for architectural design, suggesting that increasing hidden width does not always enhance memory capacity, and they open pathways for applying refined RDT methods to broader activation functions and deeper networks.

Abstract

. In the process, we also uncover the following two, rather remarkable, facts: 1) contrary to the common wisdom, both sets of results show that the bounding capacity decreases for large

(the width of the hidden layer) while converging to a constant value; and 2) the maximum bounding capacity is achieved for the networks with precisely \textbf{\emph{two}} hidden layer neurons! Moreover, the large

converging values are observed to be in excellent agrement with the statistical physics replica theory based predictions.

Paper Structure (16 sections, 10 theorems, 113 equations, 4 figures, 1 table)

This paper contains 16 sections, 10 theorems, 113 equations, 4 figures, 1 table.

Introduction
Feed forward neural networks -- mathematical basics
Technical assumptions
Prior work
Contributions
Algebraic description of network data processing
Random Duality Theory (RDT) based capacity analysis
Different ${\bf f}^{(2)}$ activations
Linear hidden layer activations -- ${\bf f}^{(2)}({\bf x})={\bf x}$
Quadratic hidden layer activations -- ${\bf f}^{(2)}({\bf x})={\bf x}^2$
ReLU hidden layer activations -- ${\bf f}^{(2)}({\bf x})=\max({\bf x},0)$
Partially lifted Random Duality Theory (pl RDT)
Specialization to particular ${\bf f}^{(2)}$ activations
Pl RDT capacity estimates for quadratic activations -- ${\bf f}^{(2)}({\bf x})={\bf x}^2$
Pl RDT capacity estimates for ReLU activations -- ${\bf f}^{(2)}({\bf x})=\max({\bf x},0)$
...and 1 more sections

Key Result

Lemma 1

(Algebraic optimization representation) Assume a 1-hidden layer TCM with architecture $A([n,d,1];{\bf f}^{(2)})$. Any given data set $\left ({\bf x}^{(0,k)},1\right )_{k=1:m}$ can not be properly memorized by the network if where and $X\triangleq ^T$.

Figures (4)

Figure 1: Memory capacity upper bound as a function of the number of neurons, $d$, in the hidden layer; 1-hidden layer TCM with quadratic activations; plain RDT versus partially lifted RDT (Replica symmetry (RS) and Partial 1rsb$d\rightarrow\infty$ estimates are included as well)
Figure 2: Memory capacity upper bound as a function of the number of neurons, $d$, in the hidden layer; 1-hidden layer TCM with quadratic activations; plain RDT estimate (Replica symmetry (RS)$d\rightarrow\infty$ estimate is included as well)
Figure 3: Memory capacity upper bound as a function of the number of neurons, $d$, in the hidden layer; 1-hidden layer TCM with ReLU activations; plain RDT estimate (Replica symmetry (RS)$d\rightarrow\infty$ estimate is included as well)
Figure 4: Memory capacity upper bound as a function of the number of neurons, $d$, in the hidden layer; 1-hidden layer TCM with quadratic activations; plain RDT versus partially lifted RDT (Replica symmetry (RS) and Partial 1rsb$d\rightarrow\infty$ estimates are included as well)

Theorems & Definitions (20)

Lemma 1
proof
Theorem 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
...and 10 more

Fixed width treelike neural networks capacity analysis -- generic activations

TL;DR

Abstract

Fixed width treelike neural networks capacity analysis -- generic activations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (20)