Table of Contents
Fetching ...

Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum

Amnon Geifman, Daniel Barzilai, Ronen Basri, Meirav Galun

TL;DR

Modified Spectrum Kernels are introduced, a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known, and a preconditioned gradient descent method, which alters the trajectory of GD.

Abstract

Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement.

Controlling the Inductive Bias of Wide Neural Networks by Modifying the Kernel's Spectrum

TL;DR

Modified Spectrum Kernels are introduced, a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known, and a preconditioned gradient descent method, which alters the trajectory of GD.

Abstract

Wide neural networks are biased towards learning certain functions, influencing both the rate of convergence of gradient descent (GD) and the functions that are reachable with GD in finite training time. As such, there is a great need for methods that can modify this bias according to the task at hand. To that end, we introduce Modified Spectrum Kernels (MSKs), a novel family of constructed kernels that can be used to approximate kernels with desired eigenvalues for which no closed form is known. We leverage the duality between wide neural networks and Neural Tangent Kernels and propose a preconditioned gradient descent method, which alters the trajectory of GD. As a result, this allows for a polynomial and, in some cases, exponential training speedup without changing the final solution. Our method is both computationally efficient and simple to implement.
Paper Structure (12 sections, 12 theorems, 75 equations, 2 figures, 1 algorithm)

This paper contains 12 sections, 12 theorems, 75 equations, 2 figures, 1 algorithm.

Key Result

Theorem 3.2

Let $g, \boldsymbol{k}, \boldsymbol{k}_g$ be as in Def. Def:MSK and assume that $\forall \mathbf{x}\in \mathcal{X},\left|\Phi_i(\mathbf{x})\right|\leq M$. Let $K,K_g$ be the corresponding kernel matrices on i.i.d samples $\mathbf{x}_1,..,\mathbf{x}_n\in \mathcal{X}$. Define the kernel matrix $\tilde where a.s. stands for almost surely.

Figures (2)

  • Figure 1: Numerical validation. Top: The number of iterations required to learn different Fourier components as a function of frequency. Standard SGD is shown in blue, and (stochastic) PGD with a preconditioner derived from the NTK matrix $K$ and the empirical NTK $K_t$ respectively are shown in green and orange. Bottom: Training curves with four different frequencies ($k=5,7,10,12$). The graphs show the MSE loss as a function of iteration number with stochastic GD and PGD.
  • Figure 2: The effect of various MSKs on overfitting to noise. Choosing an MSK with a slower eigenvalue decay helps prevent overfitting. Starting from a Laplace kernel, for each choice of $g$, we perform unregularized kernel regression (with $\gamma=0$) using an MSK as defined in \ref{['Def:MSK']}. The target function is identically $0$, with i.i.d Gaussian noise added to the training set ($y_i\sim \mathcal{N}(0,1)$). The $x$ axis denotes the number of samples, and $y$ axis the MSE on the test set (for which the target is $0$).

Theorems & Definitions (20)

  • Definition 3.1
  • Theorem 3.2
  • Theorem 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Corollary 4.4
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • ...and 10 more