Table of Contents
Fetching ...

Equivariant Neural Tangent Kernels

Philipp Misof, Pan Kessel, Jan E. Gerken

TL;DR

This work develops layer-wise recursion relations for the Neural Tangent Kernel and Neural Network Gaussian Process kernels of group convolutional neural networks (GCNNs), enabling analytic study of training dynamics for equivariant architectures. It proves that in the infinite-width limit, training with full data augmentation on a non-equivariant network yields the same mean predictions as a manifestly equivariant GCNN trained without augmentation, and extends these results to data off the manifold. The authors specialize the theory to roto-translations in the plane ($G=C_{n}\ltimes\mathbb{R}^{2}$) and to 3D rotations ($G=\mathrm{SO}(3)$), deriving efficient kernel recursions and implementing them in the neural-tangents framework. Empirical results on histological image classification and molecular property prediction show that equivariant NTKs outperform their non-equivariant counterparts, and finite-width ensembles show the predicted equivalences approximately hold, validating the practical relevance of the theory.

Abstract

Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their non-equivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction at all training times and even off-manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane ($G=C_{n}\ltimes\mathbb{R}^{2}$) and 3d rotations ($G=\mathrm{SO}(3)$), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction.

Equivariant Neural Tangent Kernels

TL;DR

This work develops layer-wise recursion relations for the Neural Tangent Kernel and Neural Network Gaussian Process kernels of group convolutional neural networks (GCNNs), enabling analytic study of training dynamics for equivariant architectures. It proves that in the infinite-width limit, training with full data augmentation on a non-equivariant network yields the same mean predictions as a manifestly equivariant GCNN trained without augmentation, and extends these results to data off the manifold. The authors specialize the theory to roto-translations in the plane () and to 3D rotations (), deriving efficient kernel recursions and implementing them in the neural-tangents framework. Empirical results on histological image classification and molecular property prediction show that equivariant NTKs outperform their non-equivariant counterparts, and finite-width ensembles show the predicted equivalences approximately hold, validating the practical relevance of the theory.

Abstract

Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their non-equivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction at all training times and even off-manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane () and 3d rotations (), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction.
Paper Structure (36 sections, 18 theorems, 126 equations, 7 figures, 3 tables)

This paper contains 36 sections, 18 theorems, 126 equations, 7 figures, 3 tables.

Key Result

Theorem 1

The layer-wise recursive relations for the NNGP and NTK of the group convolutional layer eq:9 are given by

Figures (7)

  • Figure 1: Convergence of the Monte-Carlo estimates of the NTK to their infinite-width limits for $G=C_{4}\ltimes\mathbb{R}^{2}$. Plotted is the relative error averaged over the components of a $3\times3$ Gram matrix for networks with a ReLU or an error function nonlinearity. Bands show $\pm$ one standard deviation of the estimator.
  • Figure 2: Convergence of finite-width ensembles trained with data augmentation to ensembles of GCNNs on MNIST.$L^2$-distance between the logits of the equivariant- and non-equivariant ensemble trained with data augmentation for different ensemble sizes on out of distribution data. For larger ensembles, the distance decreases.
  • Figure 3: NTK for image classification. Test accuracy of the arising NTK kernel methods in the infinite width and infinite training time limit for different training set sizes. The results for both a conventional CNN and a $C_4 \ltimes \mathbb{R}^2$-invariant GCNN are shown.
  • Figure 4: NTK for molecular energy prediction. Molecular energy MAEs of the NTK kernel methods in the infinite width and training time limit for different training set sizes. The results are for both a conventional MLP and a $\mathop{\mathrm{SO}}\nolimits(3)$-invariant GCNN.
  • Figure 5: Convergence of the Monte-Carlo estimates of the NNGP to their infinite-width limits for $G=C_{4}\ltimes\mathbb{R}^{2}$. Plotted is the relative error averaged over the components of a $3\times3$ Gram matrix for networks with a ReLU or an error function nonlinearity. The bands correspond to $\pm$ one standard deviation of the estimator.
  • ...and 2 more figures

Theorems & Definitions (36)

  • Theorem 1: Kernel recursions for group convolutional layers
  • proof
  • Theorem 2: Kernel recursions for the lifting layer
  • proof
  • Theorem 3: Kernel recursions for group pooling layer
  • proof
  • Corollary 3: Kernel recursions for nonlinearities
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 26 more