Table of Contents
Fetching ...

Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

Yuhao Liu, Zilin Wang, Lei Wu, Shaobo Zhang

TL;DR

It is proved that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors).

Abstract

Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space $W^{s,\infty}([0,1]^d)$ for arbitrary smoothness $s>0$. We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.

Smoothness Adaptivity in Constant-Depth Neural Networks: Optimal Rates via Smooth Activations

TL;DR

It is proved that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors).

Abstract

Smooth activation functions are ubiquitous in modern deep learning, yet their theoretical advantages over non-smooth counterparts remain poorly understood. In this work, we characterize both approximation and statistical properties of neural networks with smooth activations over the Sobolev space for arbitrary smoothness . We prove that constant-depth networks equipped with smooth activations automatically exploit arbitrarily high orders of target function smoothness, achieving the minimax-optimal approximation and estimation error rates (up to logarithmic factors). In sharp contrast, networks with non-smooth activations, such as ReLU, lack this adaptivity: their attainable approximation order is strictly limited by depth, and capturing higher-order smoothness requires proportional depth growth. These results identify activation smoothness as a fundamental mechanism, alternative to depth, for attaining statistical optimality. Technically, our results are established via a constructive approximation framework that produces explicit neural network approximators with carefully controlled parameter norms and model size. This complexity control ensures statistical learnability under empirical risk minimization (ERM) and removes the impractical sparsity constraints commonly required in prior analyses.
Paper Structure (45 sections, 33 theorems, 271 equations, 4 figures, 1 table)

This paper contains 45 sections, 33 theorems, 271 equations, 4 figures, 1 table.

Key Result

Theorem 3.1

Let $\phi$ satisfy Assumptions ass:smooth--ass:piece. For any $s>0$ and any $f^\star \in W^{s,\infty}([0,1]^d)$ with $\|f^\star\|_{W^{s,\infty}([0,1]^d)} \leqslant 1$, and for any $\epsilon \in (0,1)$, there exists a constant-depth neural network with such that

Figures (4)

  • Figure 1: Generalization error versus sample size for two-layer networks trained with different activation functions. Markers denote the measured generalization errors at each sample size (averaged over 5 runs), and solid lines show least-squares fits of the form $E(n)\propto n^{-\alpha}$. The fitted exponents $\alpha$, reported in the legend, indicate a faster decay of the generalization error for smooth activations as the sample size increases.
  • Figure 2: Illustration of the approximator construction for $f^\star$ in Theorem \ref{['thm:app_infinite_local']} with $d=1$ and $K = 2$. (a) Approximate $f^\star$ by piecewise polynomials, realized as the product of global polynomials and piecewise constant functions. (b) The $4$-piece piecewise constant function on refined cells is decomposed into a summation of two $2$-piece functions defined over coarse cells, multiplied by refined-cell indicator functions. (c) The refined-cell indicator functions are realized by taking the extracted relative position information $x - a(x)$ as the input for the reference indicators $\mathbbm{1}_{[0, 0.25]}$ and $\mathbbm{1}_{[0.25, 0.5]}$, which correspond to the refined cells contained within the leftmost coarse cell $[0, 0.5]$.
  • Figure 4: Illustration of the constructive approximation for the weight function $w_1$, instantiated with $K=2$. Dependencies on indices $\phi, \beta$ and $\delta$ are suppressed for clarity. Panels (a) and (b) visualize the approximators constructed in Lemma \ref{['lem:app_wieght_band']} and Lemma \ref{['lem:app_univariate_weight_Linfty']}, respectively. The orange shaded region denotes the domain $\Omega_{\mathrm{coarse,band}}^{K,1}(\delta)$, while the blue region corresponds to the difference set $\Omega_{\mathrm{coarse,band}}^{K,1}(2\delta) \setminus \Omega_{\mathrm{coarse,band}}^{K,1}(\delta)$.
  • Figure 5: Illustration of the $L^{\infty}([0,1])$ approximation strategy for $f^\star$ detailed in Theorem \ref{['thm:app_Linfty']}. Large approximation errors of $f_i$ on the bands $\Omega_{\mathrm{band}}^{2,i}(\delta)$ are nullified by the vanishing weight functions $w_i(x)$. Since the weights constitute a partition of unity, the global reconstruction $w_1f_1 + w_2f_2$ maintains the desired approximation accuracy across the entire domain $[0,1]$.

Theorems & Definitions (73)

  • Theorem 3.1: $L^2$ approximation
  • Remark 3.2: Width--norm trade-off
  • Theorem 3.3: $L^\infty$ approximation
  • Theorem 4.1
  • Proposition 5.1: Approximation lower bound for constant-depth ReLU networks
  • Remark 6.1
  • Lemma B.1: Bramble-Hilbert lemma
  • proof
  • Lemma B.2
  • proof
  • ...and 63 more