Table of Contents
Fetching ...

How DNNs break the Curse of Dimensionality: Compositionality and Symmetry Learning

Arthur Jacot, Seok Hoan Choi, Yuxiao Wen

TL;DR

This work addresses how deep neural networks can overcome the curse of dimensionality by exploiting compositional structure and symmetry learning. It introduces Accordion Networks (AccNets) and derives generalization bounds that hinge on an $F_{1}$-norm based complexity measure coupled with layer Lipschitz constants, enabling width-independent control of the generalization gap. By connecting AccNets to compositions of Sobolev balls, the authors show that deep compositional representations can learn high-dimensional functions with near-optimal rates, particularly when inner maps reduce effective dimensionality via symmetries. Empirical results on synthetic data and real tasks illustrate phase transitions in learning difficulty and validate the predicted scaling, suggesting practical advantages for symmetry-aware deep architectures in high-dimensional settings.

Abstract

We show that deep neural networks (DNNs) can efficiently learn any composition of functions with bounded $F_{1}$-norm, which allows DNNs to break the curse of dimensionality in ways that shallow networks cannot. More specifically, we derive a generalization bound that combines a covering number argument for compositionality, and the $F_{1}$-norm (or the related Barron norm) for large width adaptivity. We show that the global minimizer of the regularized loss of DNNs can fit for example the composition of two functions $f^{*}=h\circ g$ from a small number of observations, assuming $g$ is smooth/regular and reduces the dimensionality (e.g. $g$ could be the quotient map of the symmetries of $f^{*}$), so that $h$ can be learned in spite of its low regularity. The measures of regularity we consider is the Sobolev norm with different levels of differentiability, which is well adapted to the $F_{1}$ norm. We compute scaling laws empirically and observe phase transitions depending on whether $g$ or $h$ is harder to learn, as predicted by our theory.

How DNNs break the Curse of Dimensionality: Compositionality and Symmetry Learning

TL;DR

This work addresses how deep neural networks can overcome the curse of dimensionality by exploiting compositional structure and symmetry learning. It introduces Accordion Networks (AccNets) and derives generalization bounds that hinge on an -norm based complexity measure coupled with layer Lipschitz constants, enabling width-independent control of the generalization gap. By connecting AccNets to compositions of Sobolev balls, the authors show that deep compositional representations can learn high-dimensional functions with near-optimal rates, particularly when inner maps reduce effective dimensionality via symmetries. Empirical results on synthetic data and real tasks illustrate phase transitions in learning difficulty and validate the predicted scaling, suggesting practical advantages for symmetry-aware deep architectures in high-dimensional settings.

Abstract

We show that deep neural networks (DNNs) can efficiently learn any composition of functions with bounded -norm, which allows DNNs to break the curse of dimensionality in ways that shallow networks cannot. More specifically, we derive a generalization bound that combines a covering number argument for compositionality, and the -norm (or the related Barron norm) for large width adaptivity. We show that the global minimizer of the regularized loss of DNNs can fit for example the composition of two functions from a small number of observations, assuming is smooth/regular and reduces the dimensionality (e.g. could be the quotient map of the symmetries of ), so that can be learned in spite of its low regularity. The measures of regularity we consider is the Sobolev norm with different levels of differentiability, which is well adapted to the norm. We compute scaling laws empirically and observe phase transitions depending on whether or is harder to learn, as predicted by our theory.
Paper Structure (27 sections, 20 theorems, 113 equations, 4 figures)

This paper contains 27 sections, 20 theorems, 113 equations, 4 figures.

Key Result

Theorem 1

For any input distribution $\pi$ supported on the $L_2$ ball $B(0,b)$ with radius $b$, we have with probability $1-p$, over the training samples $x_{1},\dots,x_{N}$, that for all $f\in \{f: \| f\|_{F_1}\leq R , \| f\|_{\infty} \leq B \}$ Therefore if $\tilde{\mathcal{L}}_{N}(f)= O(BR N^{-\frac{1}{2}})$ then $\mathcal{L}(f)= O(BR N^{-\frac{1}{2}})$.

Figures (4)

  • Figure 1: Visualization of scaling laws. We observe that deep networks (either AccNets or DNNs) achieve better scaling laws than kernel methods or shallow networks on certain compositional tasks, in agreement with our theory. We also see that our new generalization bounds approximately recover the right scaling laws (even though they are orders of magnitude too large overall). We consider a compositional true function $f^*=h\circ g$ where $g$ maps from dimension 15 to 3 while h maps from 3 to 20, and we denote $\nu_g,\nu_h$ for the number of times $g,h$ are differentiable. In the first plot $\nu_g =8,\nu_h=2$ so that $g$ is easy to learn while $h$ is hard, whereas in the second plot $\nu_g=9,\nu_h=10$, so both $g$ and $h$ are relatively easier. The third plot presents the test error and generalization bounds for MNIST and WESAD 10.1145/3242969.3242985.
  • Figure 2: A comparison of empirical and theoretical error rates. The frist two columns show the log decay rate of the test error with respect to the dataset size $N$ based on our empirical simulations for 4 different models. The top right plot depicts the theoretical decay rate of the test error $-\min\{\frac{1}{2},\frac{2\nu_g}{2\nu_g + d_{in}},\frac{2\nu_h}{2\nu_h + d_{mid}}\}$. The bottom right plot displays the difference between the rates of AccNets and shallow nets. The lower left region represents the area where $g$ is easier to learn than $h$, the upper right where $h$ is easier to learn than $g$, and the lower right region where both $f$ and $g$ are easy. We see that the biggest gain of AccNets over shallow nets are in the. lower left regions, where learning $h$ is hard.
  • Figure 3: A comparison: singular values of the weight matrices for DNN and AccNets models. The first two plots represent cases where $N$ = 10000 while the right two plots correspond to $N$ = 200.The number of outliers at the top of each plot signifies the rank of each network. The plots with $N=10000$ datasets demonstrate a clearer capture of the true rank compared to those with $N=200$ indicating that a higher dataset count provides more accurate rank determination.
  • Figure 4: Comparison of our bounds to a number of previous bounds on the composition of two functions dataset across different differentiabilities of $g$ and $h$. For simplicity, we drop the constant prefactors in each bound, since they are mostly an artifact of the proofs. We see that our bound (Jacot24 Jacobian) obtains strictly better than previous ones, even when using upper bounding the Lipschitz constants by the operator norms (Jacot24 OP). Since we cannot compute the Lipschitz constant, we approximate them by a max over 100 random points (taking more points does not significantly change the final bound). The lines appear flat because most of the variation is between different bounds rather than as $N$ increases, but by zooming in, we can see a downward trend, albeit a slow and noisy one. The case $(2,2)$ lies in the regime where both $g$ and $h$ are hard, in which case our theory predicts that the operator norm bound should have a worse scaling exponent than the Lipschitz based bound. The experiments appear to match our prediction, because the Lipschitz-constant based bound seems to be the only one that is decreasing in $N$, while the operator norm bound, as well as all previous bounds, are increasing (the fact that uniform generalization bounds can be increasing in $N$ has been observed in previous work nagarajan_2019_uniform_bounds_unable). Though these trends are interesting, there is clearly a lot of noise in the resulting curves, so it is difficult to confidently conclude anything from these experiments.

Theorems & Definitions (38)

  • Remark
  • Theorem 1
  • Theorem 2
  • Remark
  • Proposition 3
  • Theorem 4
  • Remark
  • Theorem 5
  • Theorem 6
  • proof
  • ...and 28 more