Multi-layer random features and the approximation power of neural networks

Rustem Takhanov

Multi-layer random features and the approximation power of neural networks

Rustem Takhanov

TL;DR

This work develops an approximation theory for multi-layer neural networks in the random-weight, infinite-width regime via the Neural Network Gaussian Process (NNGP) kernel. It proves that the RKHS $\mathcal{H}_{\Sigma^{(L)}}$ associated with the NNGP contains exactly the functions approximable by the corresponding architecture, and that a multi-layer random features model (ML-RFM) can approximate any $f$ in this RKHS with error decaying as $1/\sqrt{T}$ when the last-layer weights are learned from supervised data. The authors contrast this RKHS-based view with Barron-space theory on the sphere, showing a two-class activation-function dichotomy determined by the decay rate of the NNGP eigenvalues: slow decay yields potential advantages for ML-RFM over Barron’s theorem, while fast decay yields cases where Barron bounds remain competitive. Experiments corroborate the theory and reveal that realistic networks can learn target functions even outside rigorous guarantees, underscoring that optimization and finite-width effects extend beyond NTK-type analyses. These results advance understanding of when and how deep random feature constructions can efficiently approximate and learn functions in neural architectures.

Abstract

A neural architecture with randomly initialized weights, in the infinite width limit, is equivalent to a Gaussian Random Field whose covariance function is the so-called Neural Network Gaussian Process kernel (NNGP). We prove that a reproducing kernel Hilbert space (RKHS) defined by the NNGP contains only functions that can be approximated by the architecture. To achieve a certain approximation error the required number of neurons in each layer is defined by the RKHS norm of the target function. Moreover, the approximation can be constructed from a supervised dataset by a random multi-layer representation of an input vector, together with training of the last layer's weights. For a 2-layer NN and a domain equal to an $n-1$-dimensional sphere in ${\mathbb R}^n$, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than $k^{-n-\frac{2}{3}}$ where $k$ is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.

Multi-layer random features and the approximation power of neural networks

TL;DR

This work develops an approximation theory for multi-layer neural networks in the random-weight, infinite-width regime via the Neural Network Gaussian Process (NNGP) kernel. It proves that the RKHS

associated with the NNGP contains exactly the functions approximable by the corresponding architecture, and that a multi-layer random features model (ML-RFM) can approximate any

in this RKHS with error decaying as

when the last-layer weights are learned from supervised data. The authors contrast this RKHS-based view with Barron-space theory on the sphere, showing a two-class activation-function dichotomy determined by the decay rate of the NNGP eigenvalues: slow decay yields potential advantages for ML-RFM over Barron’s theorem, while fast decay yields cases where Barron bounds remain competitive. Experiments corroborate the theory and reveal that realistic networks can learn target functions even outside rigorous guarantees, underscoring that optimization and finite-width effects extend beyond NTK-type analyses. These results advance understanding of when and how deep random feature constructions can efficiently approximate and learn functions in neural architectures.

Abstract

-dimensional sphere in

, we compare the number of neurons required by Barron's theorem and by the multi-layer features construction. We show that if eigenvalues of the integral operator of the NNGP decay slower than

where

is an order of an eigenvalue, then our theorem guarantees a more succinct neural network approximation than Barron's theorem. We also make some computational experiments to verify our theoretical findings. Our experiments show that realistic neural networks easily learn target functions even when both theorems do not give any guarantees.

Paper Structure (17 sections, 17 theorems, 109 equations, 7 figures)

This paper contains 17 sections, 17 theorems, 109 equations, 7 figures.

Introduction
Preliminaries and notations
Fully connected feed-forward neural network and associated kernels
Main results
A relationship with the Barron space
Experiments
Conclusions
Proof of Theorem \ref{['finite-width-kernel']}
Proof of Theorem \ref{['emp-kernel-concentration']}: concentration of $\Sigma^{(h)}_{\rm emp}({\mathbf x},{\mathbf x}')$ around its mean
Proof of Theorem \ref{['deviation']}: An approximation of $\tilde{\Sigma}^{(h)}({\mathbf x},{\mathbf x}')$ by $\Sigma^{(h)}({\mathbf x},{\mathbf x}')$
Proof of Theorem \ref{['main-theorem']}
Properties of $\overline{\sigma}$
Proof of Theorem \ref{['negative']}
Proof sketch of Theorem \ref{['positive']}
The case of the Gaussian activation function
...and 2 more sections

Key Result

Theorem 1

Let $\mu$ be a probabilistic measure on $\boldsymbol{\Omega}\subseteq {\mathbb R}^n$, $\sigma$ be bounded, and $n_1, \cdots, n_{L}, T\in {\mathbb N}$, $n_L=1$. Then, for any $f\in \mathcal{H}_{\tilde{\Sigma}^{(L)}}$ there exist matrices $W^{(i,h)}\in {\mathbb R}^{n_h\times n_{h-1}}$, where $h=1,\cdo where $\tilde{f}({\mathbf x}) = \sum_{i=1}^{T}w_{i}\sigma(W^{(i,L)}\sigma(\cdots \sigma(W^{(i,1)} {

Figures (7)

Figure 1: An architecture for $n_0=3, n_1=n_2=3, n_3=1, T=3$.
Figure 2: $\log(\hat{\mu}_i)$ versus $\log(i)$ scatter plots for different activation functions with linear regression lines. For relu and erf, eigenvalues of analytically computed NNGP kernels are given for comparison.
Figure 3: Achieved MSE when learning $Y_k$ by random features model as a function of the number of hidden neurons ($n=4$). Pictures for other $n$ can be found in the Appendix.
Figure 4: MSE dynamics during learning $Y_k$ by a 2-NN with the number of hidden neurons 256 and 1024 (rows) and an activation function (columns): (a) $\sigma(x) = e^{-\frac{x^2}{2}}$, (b) $\sigma(x) = \cos(x)$, (c) $\sigma(x) = {\rm ReLU}(x)$.
Figure 5: Achieved MSE when learning $Y_k$ by random features model as a function of the number of hidden neurons ($n=3,6$).
...and 2 more figures

Theorems & Definitions (38)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Remark 1
Remark 2
Definition 1
Theorem 5
Theorem 6
Theorem 7
...and 28 more

Multi-layer random features and the approximation power of neural networks

TL;DR

Abstract

Multi-layer random features and the approximation power of neural networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (38)