Table of Contents
Fetching ...

On Interpolation Formulas Describing Neural Network Generalization

Jin Guo, Roy Y. He, Jean-Michel Morel

Abstract

In 2020 Domingos introduced an interpolation formula valid for "every model trained by gradient descent". He concluded that such models behave approximately as kernel machines. In this work, we extend the Domingos formula to stochastic training. We introduce a stochastic gradient kernel that extends the deterministic version via a continuous-time diffusion approximation. We prove stochastic Domingos theorems and show that the expected network output admits a kernel-machine representation with optimizer-specific weighting. It reveals that training samples contribute through loss-dependent weights and gradient alignment along the training trajectory. We then link the generalization error to the null space of the integral operator induced by the stochastic gradient kernel. The same path-kernel viewpoint provides a unified interpretation of diffusion models and GANs: diffusion induces stage-wise, noise-localized corrections, whereas GANs induce distribution-guided corrections shaped by discriminator geometry. We visualize the evolution of implicit kernels during optimization and quantify out-of-distribution behaviors through a series of numerical experiments. Our results support a feature-space memory view of learning: training stores data-dependent information in an evolving tangent feature geometry, and predictions at test time arise from kernel-weighted retrieval and aggregation of these stored features, with generalization governed by alignment between test points and the learned feature memory.

On Interpolation Formulas Describing Neural Network Generalization

Abstract

In 2020 Domingos introduced an interpolation formula valid for "every model trained by gradient descent". He concluded that such models behave approximately as kernel machines. In this work, we extend the Domingos formula to stochastic training. We introduce a stochastic gradient kernel that extends the deterministic version via a continuous-time diffusion approximation. We prove stochastic Domingos theorems and show that the expected network output admits a kernel-machine representation with optimizer-specific weighting. It reveals that training samples contribute through loss-dependent weights and gradient alignment along the training trajectory. We then link the generalization error to the null space of the integral operator induced by the stochastic gradient kernel. The same path-kernel viewpoint provides a unified interpretation of diffusion models and GANs: diffusion induces stage-wise, noise-localized corrections, whereas GANs induce distribution-guided corrections shaped by discriminator geometry. We visualize the evolution of implicit kernels during optimization and quantify out-of-distribution behaviors through a series of numerical experiments. Our results support a feature-space memory view of learning: training stores data-dependent information in an evolving tangent feature geometry, and predictions at test time arise from kernel-weighted retrieval and aggregation of these stored features, with generalization governed by alignment between test points and the learned feature memory.
Paper Structure (26 sections, 13 theorems, 99 equations, 10 figures)

This paper contains 26 sections, 13 theorems, 99 equations, 10 figures.

Key Result

Theorem 2.2

Let $f(\cdot,\mathbf{\Theta})$ be a differentiable model trained on $\{(x_n,y_n^*)\}_{n=1}^N$ via gradient flow eq: gradient-flow on the empirical loss $L(\mathbf{\Theta}) = \frac{1}{N}\sum_{n=1}^N \ell(f(x_n,\mathbf{\Theta}), y_n^*)$. Then for any input $x \in \mathbb{R}^p$, the model output at tim

Figures (10)

  • Figure 1: Evolving neighborhoods around anchor points (marked as "x"). (a) Binary dataset with class $y=1$ on the circle $x_1^2+x_2^2=1$ and class $y=-1$ inside the disk $x_1^2+x_2^2\le 0.8$. (b) SGD decision regions after training. In (c–f), black crosses denote fixed anchor inputs (coordinates listed in the legend). For each anchor, we highlight its $100$ nearest training points under the specified notion of similarity. (c) Neighbors in Euclidean input space, which are largely class mixed. (d–f) Neighbors under the normalized gradient kernel $\widehat{K}_t$ at initialization, the 200th epoch, and the last epoch. Training progressively sharpens these kernel neighborhoods toward label-homogeneous sets.
  • Figure 2: Prediction and gradient kernel evolution. (a) Learned prediction versus ground truth for the sine function $\sin x$. (c) Learned prediction versus ground truth for the square-wave target. (b,d) Normalized gradient kernel matrices at the final iteration (1000 steps) for $\sin x$ and the square wave, respectively. (e–f) MNIST and (g–h) CIFAR-10: normalized gradient kernel matrices at initialization (e,g) and after convergence (f,h). For MNIST/CIFAR-10, we use 10 images per class (100 samples total), ordered by class.
  • Figure 3: SVM classification in input space versus tangent-feature space. In (a,b), we fit a linear SVM using the learned tangent features $\phi(x_n)=\nabla_\mathbf{\Theta} f(x_n,\mathbf{\Theta}_T)$ as inputs (each sample $x_n$ is represented by $\phi(x_n)$, not by its coordinates). In (c,d), we fit an SVM with a Gaussian RBF kernel on the raw inputs. Panels (a,c) use the isotropic circle task $x: x_1^2+x_2^2\le 0.9$ (label $-1$) versus $x: x_1^2+x_2^2=1$ (label $+1$); panels (b,d) use the anisotropic ellipse variant $x: 100x_1^2+x_2^2\le 0.9$ (label $-1$) versus $x: 100x_1^2+x_2^2=1$ (label $+1$).
  • Figure 4: UMAP visualization of MNIST digits in the tangent feature space $\{\nabla_\mathbf{\Theta} f(x_n,\mathbf{\Theta}_k)\}_{n=1}^N$. Left: At initialization ($0$-th epoch), digits $0$–$4$ are not clearly separated, indicating limited discriminative structure. Middle: After training on digits $0$–$4$ ($200$-th epoch), the tangent-space representations of these classes become well clustered and separable. Right: Tangent-feature visualization of all digits: while trained digits ($0$–$4$) remain well separated, unseen digits ($5$–$9$) fail to form distinct clusters.
  • Figure 5: Normalized gradient kernel for out-of-domain printed digits. Normalized gradient kernel between MNIST digits $k$ (indices $2k\times 10$-$2(k+1)\times 10$) and printed digits $k$ (blocks at indices $(2k+1)\times 10$-$(2k+2)\times 10$, where $k=0,1,\cdots,9$) at initialization (a) and after training ($200$-th epoch) (b). Samples are ordered by class, with 10 images per class for MNIST digits (0–9) and 10 printed images per digit; the block structure highlights within-digit and cross-digit similarities between the two domains. (c) Confusion matrix for classifying printed digits using the network trained on MNIST.
  • ...and 5 more figures

Theorems & Definitions (26)

  • Definition 2.1: Gradient kernel and path kernel
  • Theorem 2.2: Domingos' Theorem domingos2020every
  • proof
  • Definition 3.1: $\alpha$-th order weak approximation li2019stochastic
  • Lemma 3.1: First-order weak approximation of SGD li2019stochastic
  • Definition 3.2: Stochastic gradient kernel
  • Definition 3.3: Weighted stochastic gradient kernel
  • Theorem 3.4: Stochastic Domingos' Theorem (SGD)
  • proof
  • Theorem 3.5: Stochastic Domingos' Theorem (SGDM)
  • ...and 16 more