Table of Contents
Fetching ...

On the Sample Complexity of One Hidden Layer Networks with Equivariance, Locality and Weight Sharing

Arash Behboodi, Gabriele Cesa

TL;DR

This paper analyzes how equivariance, locality, and weight sharing influence the sample complexity of one-hidden-layer networks through Rademacher complexity-based bounds. It derives dimension-free bounds for group-convolution and equivariant architectures, extends to max-pooling and multi-layer networks with mild dimension dependence, and provides a matching lower bound for the Rademacher complexity. The authors also connect the analysis to general equivariant networks on compact groups, weight-sharing schemes, and locally constrained filters, highlighting a trade-off between locality and expressivity via an uncertainty-principle argument. Empirical results on rotated MNIST and CIFAR-10 validate the theoretical bound's relevance and reveal consistent trends with respect to group size, pooling, and frequency-domain locality. Overall, the work clarifies when and how architectural biases like symmetry, locality, and weight sharing can improve generalization in neural networks, offering dimension-free insights and practical guidance for design choices in symmetry-aware models.

Abstract

Weight sharing, equivariance, and local filters, as in convolutional neural networks, are believed to contribute to the sample efficiency of neural networks. However, it is not clear how each one of these design choices contributes to the generalization error. Through the lens of statistical learning theory, we aim to provide insight into this question by characterizing the relative impact of each choice on the sample complexity. We obtain lower and upper sample complexity bounds for a class of single hidden layer networks. For a large class of activation functions, the bounds depend merely on the norm of filters and are dimension-independent. We also provide bounds for max-pooling and an extension to multi-layer networks, both with mild dimension dependence. We provide a few takeaways from the theoretical results. It can be shown that depending on the weight-sharing mechanism, the non-equivariant weight-sharing can yield a similar generalization bound as the equivariant one. We show that locality has generalization benefits, however the uncertainty principle implies a trade-off between locality and expressivity. We conduct extensive experiments and highlight some consistent trends for these models.

On the Sample Complexity of One Hidden Layer Networks with Equivariance, Locality and Weight Sharing

TL;DR

This paper analyzes how equivariance, locality, and weight sharing influence the sample complexity of one-hidden-layer networks through Rademacher complexity-based bounds. It derives dimension-free bounds for group-convolution and equivariant architectures, extends to max-pooling and multi-layer networks with mild dimension dependence, and provides a matching lower bound for the Rademacher complexity. The authors also connect the analysis to general equivariant networks on compact groups, weight-sharing schemes, and locally constrained filters, highlighting a trade-off between locality and expressivity via an uncertainty-principle argument. Empirical results on rotated MNIST and CIFAR-10 validate the theoretical bound's relevance and reveal consistent trends with respect to group size, pooling, and frequency-domain locality. Overall, the work clarifies when and how architectural biases like symmetry, locality, and weight sharing can improve generalization in neural networks, offering dimension-free insights and practical guidance for design choices in symmetry-aware models.

Abstract

Weight sharing, equivariance, and local filters, as in convolutional neural networks, are believed to contribute to the sample efficiency of neural networks. However, it is not clear how each one of these design choices contributes to the generalization error. Through the lens of statistical learning theory, we aim to provide insight into this question by characterizing the relative impact of each choice on the sample complexity. We obtain lower and upper sample complexity bounds for a class of single hidden layer networks. For a large class of activation functions, the bounds depend merely on the norm of filters and are dimension-independent. We also provide bounds for max-pooling and an extension to multi-layer networks, both with mild dimension dependence. We provide a few takeaways from the theoretical results. It can be shown that depending on the weight-sharing mechanism, the non-equivariant weight-sharing can yield a similar generalization bound as the equivariant one. We show that locality has generalization benefits, however the uncertainty principle implies a trade-off between locality and expressivity. We conduct extensive experiments and highlight some consistent trends for these models.

Paper Structure

This paper contains 65 sections, 23 theorems, 171 equations, 6 figures, 2 tables.

Key Result

Theorem 4.1

Consider the hypothesis space ${\mathcal{H}}$ defined in eq. def:gcnn_hyp_space. If $P(\cdot)$ is the pooling operation represented as $P\circ {\bm{z}} = \phi(\frac{1}{|G|} \mathbf{1}^\top \rho({\bm{z}}))$, where the two functions $\rho(\cdot)$, $\phi(\cdot)$ and the activation function $\sigma(\cdo

Figures (6)

  • Figure 1: Visualization of the network architectures with equivariance, locality, and weight sharing. On the right, we also summarize how each choice impacts the generalization error in our theory.
  • Figure 2: Numerical results for the generalization error on the rotated MNIST and CIFAR10 datasets. The plots on the left, (a)-(c), confirm that our theoretical bound captures the effect of different configurations - equivariance groups $G$, training set sizes $m$ and datasets - on the generalization error, i.e. there is a positive correlation between the generalization error and our theoretical bound $\frac{M_1M_2}{\sqrt{m}}$ across all these cases. On the right, (b)-(d), we verify the bound decreases following a trend similar to $\frac{1}{\sqrt{m}}$ and approaching zero for large training set sizes $m$.
  • Figure 3: Our bound vs. the generalization error on CIFAR10 (training set size $m=3200$) when varying the maximum frequency used to parameterize the filters, which is the locality effect in the frequency domain. Each dot and its error bars represent the mean and standard deviation over at least $3$ runs with the same configuration. As expected, architectures leveraging a lower frequency design achieve lower generalization error and our bound can exactly capture this effect. Note that increased frequencies are still beneficial for the final test performance: we study the trade-off between generalization and test performance as a function of the model frequency in Fig. \ref{['fig:pareto']}.
  • Figure 4: Test accuracy vs. Generalization error on CIFAR10 when varying the maximum frequency used to parameterize the filters and the training set size $m$. Each dot is the average performance over at least $3$ runs with the same configuration. In Fig. \ref{['fig:correlation_frequency']}, we found that higher frequencies correlated with higher generalization error: here, we study the trade-off between generalization and test performance. For each dataset size $m$, increasing the frequency improves the test performance until a certain saturation point; beyond that, increased frequencies mostly lead to increased generalization error. We highlight this effect by drawing four dotted lines following the trends for varying frequencies with $G=C_{32}$ on four different dataset sizes $m$: the different slopes of the curves correspond to the different saturation effects.
  • Figure 5: Graphical representation of the equivariant linear projection used to preprocess the image data. An example image is projected with a single filter rotated $8$ times and mirrored and rotated $8$ more times. The output is a $8\cdot 2 =16$ dimensional vector representing a signal over $G=D_8$. A rotation or mirroring of the input image results in a periodic shift or permutation of the output channels.
  • ...and 1 more figures

Theorems & Definitions (39)

  • Theorem 4.1
  • Remark 4.2: Frequency Domain Analysis
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 5.1
  • Proposition 6.1
  • Proposition 7.1
  • Proposition 7.2
  • Definition A.1: Group
  • Definition A.2: Cyclic Group
  • ...and 29 more