Table of Contents
Fetching ...

An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network

Taeyoung Kim, Hongseok Yang

TL;DR

This work extends the infinite-width analysis of neural networks to the input-output Jacobian by showing that an MLP and its Jacobian converge to a joint Gaussian process at initialization with a Jacobian NNGP kernel. It then characterises the training dynamics under a Jacobian-regularised objective via a Jacobian NTK, proving that the infinite-width limit yields a deterministic kernel and a linear ODE that governs learning, with the network converging to a kernel regressor defined by this JNTK. The paper provides extensive experiments demonstrating GP convergence, JNTK determinism, and constancy during training in wide networks, and analyzes a kernel-regression perspective to gain insight into Jacobian regularisation and robustness. Overall, the results connect Jacobian regularisation to a kernel-method viewpoint, offering theoretical guarantees and practical guidance for robustness in wide neural networks.

Abstract

The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.

An Infinite-Width Analysis on the Jacobian-Regularised Training of a Neural Network

TL;DR

This work extends the infinite-width analysis of neural networks to the input-output Jacobian by showing that an MLP and its Jacobian converge to a joint Gaussian process at initialization with a Jacobian NNGP kernel. It then characterises the training dynamics under a Jacobian-regularised objective via a Jacobian NTK, proving that the infinite-width limit yields a deterministic kernel and a linear ODE that governs learning, with the network converging to a kernel regressor defined by this JNTK. The paper provides extensive experiments demonstrating GP convergence, JNTK determinism, and constancy during training in wide networks, and analyzes a kernel-regression perspective to gain insight into Jacobian regularisation and robustness. Overall, the results connect Jacobian regularisation to a kernel-method viewpoint, offering theoretical guarantees and practical guidance for robustness in wide neural networks.

Abstract

The recent theoretical analysis of deep neural networks in their infinite-width limits has deepened our understanding of initialisation, feature learning, and training of those networks, and brought new practical techniques for finding appropriate hyperparameters, learning network weights, and performing inference. In this paper, we broaden this line of research by showing that this infinite-width analysis can be extended to the Jacobian of a deep neural network. We show that a multilayer perceptron (MLP) and its Jacobian at initialisation jointly converge to a Gaussian process (GP) as the widths of the MLP's hidden layers go to infinity and characterise this GP. We also prove that in the infinite-width limit, the evolution of the MLP under the so-called robust training (i.e., training with a regulariser on the Jacobian) is described by a linear first-order ordinary differential equation that is determined by a variant of the Neural Tangent Kernel. We experimentally show the relevance of our theoretical claims to wide finite networks, and empirically analyse the properties of kernel regression solution to obtain an insight into Jacobian regularisation.
Paper Structure (23 sections, 37 theorems, 296 equations, 13 figures)

This paper contains 23 sections, 37 theorems, 296 equations, 13 figures.

Key Result

Theorem 3.1

Suppose that Assumptions assm:activation-assumption and assm:dataset-assumption holds. As $d$ goes to infinity, the function $x \longmapsto ( f_d(x), J( f_d)(x)_0, \ldots, J( f_d)(x)_{d_0-1})^\intercal$ from $\mathbb{R}^{d_0}$ to $\mathbb{R}^{d_0+1}$ converges weakly in finite marginal A random func where $\mathbb{I}_{[{-}]}$ is the indicator function, the subscript ${[-]}_\alpha$ denotes the $(\a

Figures (13)

  • Figure 1: $\max$-norm distance between the Jacobian NNGP kernel and the estimate of its finite counterpart. The $x$-axis represents the width of an MLP, and the $y$-axis the $\max$-norm distance.
  • Figure 2: $\max$-norm distance between the $(1/\kappa^2)$-scaled finite JNTK $(1/\kappa^2)\Theta_{d,\theta_0}$ and the similarly-scaled limiting JNTK $\Theta$ at initialisation. The $x$-axis represents the width of an MLP, and the $y$-axis the $\max$-norm distance.
  • Figure 3: $\max$-norm distance between the $(1/\kappa^2)$-scaled finite JNTK $(1/\kappa^2)\Theta_{d,\theta_t}$ at the training step $t$, and the similarly-scaled limiting JNTK $\Theta$ at initialisation. The x-axis represents the width of an MLP, and the y-axis represents $\max$-norm distance.
  • Figure 4: Pairplot of test accuracy, test accuracy after 0.01 perturbation, test accuracy after 0.1 perturbation of eigenfeatures of robust training. The colours denote the coefficient of Jacobian regularisation.
  • Figure 5: Pairplot of test accuracy, test accuracy after 0.01 perturbation, test accuracy after 0.1 perturbation of eigenfeatures of standard training. The ellipses correspond to 4$\sigma$ confidence region of multivariate Gaussian distribution fitting points, to show the correlation clearly.
  • ...and 8 more figures

Theorems & Definitions (69)

  • Theorem 3.1: GP Convergence at Initialisation
  • Theorem 3.2
  • Definition 4.1: Finite Jacobian NTK
  • Lemma 4.2
  • Theorem 4.3: Convergence of Finite JNTK at Initialisation
  • Theorem 4.5: Constancy of Finite JNTK during Training
  • Theorem 4.6: MLPs Learnt by Robust Training
  • Theorem 4.1: Master Theorem
  • Corollary 4.2: GP Convergence
  • Remark 5.1
  • ...and 59 more