On the hardness of learning under symmetries

Bobak T. Kiani; Thien Le; Hannah Lawrence; Stefanie Jegelka; Melanie Weber

On the hardness of learning under symmetries

Bobak T. Kiani, Thien Le, Hannah Lawrence, Stefanie Jegelka, Melanie Weber

TL;DR

The paper tackles the computational hardness of learning equivariant neural networks under gradient-based optimization. By extending the correlational statistical query (CSQ) framework to invariant architectures (notably GNNs and frame-averaged CNNs) and analyzing Gaussian input distributions, it derives exponential and superpolynomial lower bounds that persist despite symmetry. It also proves NP-hardness for proper learning of GNNs and provides experiments that corroborate the hardness results. The findings suggest that symmetry alone is insufficient for efficient learnability in worst-case settings, underscoring the need for additional inductive biases or problem structure to achieve practical guarantees.

Abstract

We study the problem of learning equivariant neural networks via gradient descent. The incorporation of known symmetries ("equivariance") into neural nets has empirically improved the performance of learning pipelines, in domains ranging from biology to computer vision. However, a rich yet separate line of learning theoretic research has demonstrated that actually learning shallow, fully-connected (i.e. non-symmetric) networks has exponential complexity in the correlational statistical query (CSQ) model, a framework encompassing gradient descent. In this work, we ask: are known problem symmetries sufficient to alleviate the fundamental hardness of learning neural nets with gradient descent? We answer this question in the negative. In particular, we give lower bounds for shallow graph neural networks, convolutional networks, invariant polynomials, and frame-averaged networks for permutation subgroups, which all scale either superpolynomially or exponentially in the relevant input dimension. Therefore, in spite of the significant inductive bias imparted via symmetry, actually learning the complete classes of functions represented by equivariant neural networks via gradient descent remains hard.

On the hardness of learning under symmetries

TL;DR

Abstract

Paper Structure (46 sections, 40 theorems, 127 equations, 4 figures, 1 table)

This paper contains 46 sections, 40 theorems, 127 equations, 4 figures, 1 table.

Introduction
Our Contributions
Related work
Background and Notation
SQ Learning framework
Warm up: invariant Boolean functions
Lower bounds for GNNs
Hardness in number of nodes
$2$ hidden layer GNN family
Family of hard functions
Hardness in feature dimension
$1$-hidden layer GNN
Family of hard functions
NP hardness of proper learning of GNNs
CSQ Lower bound for CNNs and Frame Averaging
...and 31 more sections

Key Result

Theorem 2

For a given symmetry group $G$ with representation $\rho:G \to GL(\{-1,+1\}^n)$, let $\|p_{\mathcal{O}_\rho}\| \coloneqq ({\sum_{O_k \in \mathcal{O}_\rho} \left(\frac{|O_k|}{2^n}\right)^2})^{1/2}$ and let $\mathcal{H}_\rho$ be the class of symmetric Boolean functions, defined as Any SQ learner capable of learning $\mathcal{H}_\rho$ up to sufficiently small classification error probability $\epsil

Figures (4)

Figure 1: Overparameterized GNN (a) and CNN (b) fail to learn functions from the class $\mathcal{H}_{ER,n}$ and $C^{\mathcal{B}}_{\mathcal{F}}$ respectively by either failing to fit the training set or overfitting the data. Plots are aggregated and averaged over five random realizations.
Figure 2: Sample form of function $h(x)$ used in constructing $g_{S,b}({\bm{A}})$ as a GNN. For the construction, we will have $x = \sum_{i \in S} [{\bm{c}}_{{\bm{A}}}]_i$.
Figure 3: Replication of experiments as in \ref{['fig:experiment_performance']}, except here, we consider a minimal architecture consisting of a single layer of graph or cyclic convolution followed by a single hidden layer MLP. This is the minimal number of layers needed to learn the desired function classes for the architectures considered. For the CNN plot, the jumps in the train set MSE are due to perturbations in the loss at very low values near computer precision.
Figure 4: Replication of experiments in \ref{['fig:GNN_performance']} with different optimizers show that the performance of the GNN is virtually the same across the various optimizers. Performance is averaged over 10 runs. For each run, the learning rate is chosen by perturbing the default learning rate by a random multiplicative factor in the range $[0.1, 10]$.

Theorems & Definitions (86)

Example 1: GD from $\operatorname{CSQ}$
Definition 1: SQ (CSQ) Learning
Theorem 2: Boolean SQ hardness
proof : Proof sketch
Theorem 3: SQ hardness of $\mathcal{H}_{ER,n}$
proof : Proof sketch
Theorem 4: Exponential CSQ lower bound for GNNs
proof : Proof sketch
Proposition 5: $\mathsf{NP}$ hardness of GNN training; informal
Example 2: Frame for CNN
...and 76 more

On the hardness of learning under symmetries

TL;DR

Abstract

On the hardness of learning under symmetries

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (86)