Approximation Rates for Shallow ReLU$^k$ Neural Networks on Sobolev Spaces via the Radon Transform
Tong Mao, Jonathan W. Siegel, Jinchao Xu
TL;DR
This work addresses the problem of approximating functions in Sobolev spaces $W^s(L_q(\Omega))$ on a bounded domain by shallow ReLU$^k$ networks with width $n$, achieving nearly optimal rates (up to logarithmic factors) in a broad regime. The authors develop a variation-space framework $\mathcal{K}_1(\mathbb{P}_k^d)$ based on ridge-spline dictionaries and prove an embedding $W^s(L_2(\Omega)) \subset \mathcal{K}_1(\mathbb{P}_k^d)$ at the critical smoothness $s=(d+2k+1)/2$, using the Radon transform and the Fourier slice theorem. By combining this embedding with recent nonlinear approximation results for variation spaces and interpolation techniques, they obtain rates of the form $\|f-f_n\|_{L_p(\Omega)} \le C \|f\|_{W^s(L_p(\Omega))} n^{-s/d}$ for $2\le p\le \infty$ and $0<s\le k+(d+1)/2$, and analogous $L_\infty$ bounds via a $W^s(L_2)$-norm; these rates are optimal up to logarithmic factors. A key insight is that adaptivity enables shallow ReLU$^k$ networks to capture Sobolev smoothness up to $s=k+(d+1)/2$, despite representing fixed-degree piecewise polynomials, suggesting practical benefits for PDE-related tasks and broadening the understanding of nonlinear approximation by ridge-spline networks.
Abstract
Let $Ω\subset \mathbb{R}^d$ be a bounded domain. We consider the problem of how efficiently shallow neural networks with the ReLU$^k$ activation function can approximate functions from Sobolev spaces $W^s(L_p(Ω))$ with error measured in the $L_q(Ω)$-norm. Utilizing the Radon transform and recent results from discrepancy theory, we provide a simple proof of nearly optimal approximation rates in a variety of cases, including when $q\leq p$, $p\geq 2$, and $s \leq k + (d+1)/2$. The rates we derive are optimal up to logarithmic factors, and significantly generalize existing results. An interesting consequence is that the adaptivity of shallow ReLU$^k$ neural networks enables them to obtain optimal approximation rates for smoothness up to order $s = k + (d+1)/2$, even though they represent piecewise polynomials of fixed degree $k$.
