Table of Contents
Fetching ...

Infinite Width Limits of Self Supervised Neural Networks

Maximilian Fleissner, Gautham Govind Anil, Debarghya Ghoshdastidar

TL;DR

This work establishes that, for self-supervised learning with the Barlow Twins loss, the neural tangent kernel of a two-layer network becomes constant in the infinite-width limit. The authors present a Grönwall-based proof showing that weight updates remain in a width-independent ball, enabling the NTK to stabilize; they then derive kernelized generalization bounds and connect them to finite-width networks via NTK approximation. Empirical validation on MNIST demonstrates near-constant NTK behavior, width-independent training dynamics, and convergence of finite networks to kernel-model representations, with extensions discussed for ReLU and deeper architectures. The results justify applying classical kernel theory to SSL and offer a principled bridge from kernel methods to practical SSL representation learning. Overall, the paper provides both rigorous NTK analysis for Barlow Twins and practical generalization guarantees linking kernel and neural-network perspectives.

Abstract

The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behavior of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound -- it is a commonly encountered misbelief that the kernel behavior of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is a bit different from previous works on the NTK and may be of independent interest. Overall, our work provides a first justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of finite width.

Infinite Width Limits of Self Supervised Neural Networks

TL;DR

This work establishes that, for self-supervised learning with the Barlow Twins loss, the neural tangent kernel of a two-layer network becomes constant in the infinite-width limit. The authors present a Grönwall-based proof showing that weight updates remain in a width-independent ball, enabling the NTK to stabilize; they then derive kernelized generalization bounds and connect them to finite-width networks via NTK approximation. Empirical validation on MNIST demonstrates near-constant NTK behavior, width-independent training dynamics, and convergence of finite networks to kernel-model representations, with extensions discussed for ReLU and deeper architectures. The results justify applying classical kernel theory to SSL and offer a principled bridge from kernel methods to practical SSL representation learning. Overall, the paper provides both rigorous NTK analysis for Barlow Twins and practical generalization guarantees linking kernel and neural-network perspectives.

Abstract

The NTK is a widely used tool in the theoretical analysis of deep learning, allowing us to look at supervised deep neural networks through the lenses of kernel regression. Recently, several works have investigated kernel models for self-supervised learning, hypothesizing that these also shed light on the behavior of wide neural networks by virtue of the NTK. However, it remains an open question to what extent this connection is mathematically sound -- it is a commonly encountered misbelief that the kernel behavior of wide neural networks emerges irrespective of the loss function it is trained on. In this paper, we bridge the gap between the NTK and self-supervised learning, focusing on two-layer neural networks trained under the Barlow Twins loss. We prove that the NTK of Barlow Twins indeed becomes constant as the width of the network approaches infinity. Our analysis technique is a bit different from previous works on the NTK and may be of independent interest. Overall, our work provides a first justification for the use of classic kernel theory to understand self-supervised learning of wide neural networks. Building on this result, we derive generalization error bounds for kernelized Barlow Twins and connect them to neural networks of finite width.

Paper Structure

This paper contains 38 sections, 14 theorems, 153 equations, 3 figures, 1 table.

Key Result

Theorem 3.1

Consider an initial parameter $\theta_0 \in \mathbb{R}^p$ and a ball $B(\theta_0,R)$ around $\theta_0$ of radius $R>0$. Suppose that $\forall k \in [K]$, all inputs $a$ and all $\theta \in B(\theta_0,R)$, the Hessian of $f_k(a;\theta)$ in parameter space satisfies and that $\| \nabla_\theta f_k(a;\theta) \|_2 \le c_0$. Then, the change in the neural tangent kernel is bounded by for all $\theta \

Figures (3)

  • Figure 1: For a fixed sample size $N$, we plot different quantities for varying network width $M$. We then vary $N$ and plot: (a) NTK change till convergence (b) Training Epochs till convergence (c) Squared norm of difference between representations of neural network and corresponding kernel model
  • Figure 2: For a fixed sample size $N$, we plot different quantities for varying network width $M$. We then vary $N$ and plot: (a) NTK change till convergence (b) Training Epochs till convergence (c) Squared norm of difference between representations of neural network and corresponding kernel model.
  • Figure 3: For a fixed sample size $N$, we plot different quantities for varying network width $M$. We then vary $N$ and plot: (a) NTK change till convergence (b) Training Epochs till convergence (c) Squared norm of difference between representations of neural network and corresponding kernel model.

Theorems & Definitions (28)

  • Theorem 3.1
  • Theorem 4.1
  • Lemma 4.2
  • Definition 4.3
  • Theorem 4.4
  • Corollary 4.5
  • Remark 4.6
  • Theorem 5.1
  • Lemma 5.2
  • Theorem 5.3
  • ...and 18 more