Table of Contents
Fetching ...

Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry

Naoki Yoshida, Isao Ishikawa, Masaaki Imaizumi

TL;DR

This work provides a model-based theory for zero generalization error of interpolators in a teacher–student regression setting by leveraging real analytic sets to capture the geometry of interpolator and teacher-equivalent parameter spaces. The central result bounds the strong sample complexity by $k(\widehat{\Theta}_n) \le d_\Theta - d_{\bar{\Theta}} + 1$, implying that zero generalization error can be achieved with finite data independent of parameter count, provided the TES dimension is large. The authors instantiate the theory for deep linear and fully connected deep networks, deriving explicit TES-based bounds such as $k(\widehat{\Theta}_n) \le d^* + 1$ and $k(\widehat{\Theta}_n) \le \sum_{\ell=1}^L m_\ell^*(m_{\ell-1}+1) + 1$, respectively. Empirical results on DLNNs and MNIST corroborate the theoretical predictions using near-interpolators, demonstrating consistent data-threshold behavior even in practical, overparameterized settings.

Abstract

We theoretically demonstrate that the generalization error of interpolators for machine learning models under teacher-student settings becomes 0 once the number of training samples exceeds a certain threshold. Understanding the high generalization ability of large-scale models such as deep neural networks (DNNs) remains one of the central open problems in machine learning theory. While recent theoretical studies have attributed this phenomenon to the implicit bias of stochastic gradient descent (SGD) toward well-generalizing solutions, empirical evidences indicate that it primarily stems from properties of the model itself. Specifically, even randomly sampled interpolators, which are parameters that achieve zero training error, have been observed to generalize effectively. In this study, under a teacher-student framework, we prove that the generalization error of randomly sampled interpolators becomes exactly zero once the number of training samples exceeds a threshold determined by the geometric structure of the interpolator set in parameter space. As a proof technique, we leverage tools from algebraic geometry to mathematically characterize this geometric structure.

Zero Generalization Error Theorem for Random Interpolators via Algebraic Geometry

TL;DR

This work provides a model-based theory for zero generalization error of interpolators in a teacher–student regression setting by leveraging real analytic sets to capture the geometry of interpolator and teacher-equivalent parameter spaces. The central result bounds the strong sample complexity by , implying that zero generalization error can be achieved with finite data independent of parameter count, provided the TES dimension is large. The authors instantiate the theory for deep linear and fully connected deep networks, deriving explicit TES-based bounds such as and , respectively. Empirical results on DLNNs and MNIST corroborate the theoretical predictions using near-interpolators, demonstrating consistent data-threshold behavior even in practical, overparameterized settings.

Abstract

We theoretically demonstrate that the generalization error of interpolators for machine learning models under teacher-student settings becomes 0 once the number of training samples exceeds a certain threshold. Understanding the high generalization ability of large-scale models such as deep neural networks (DNNs) remains one of the central open problems in machine learning theory. While recent theoretical studies have attributed this phenomenon to the implicit bias of stochastic gradient descent (SGD) toward well-generalizing solutions, empirical evidences indicate that it primarily stems from properties of the model itself. Specifically, even randomly sampled interpolators, which are parameters that achieve zero training error, have been observed to generalize effectively. In this study, under a teacher-student framework, we prove that the generalization error of randomly sampled interpolators becomes exactly zero once the number of training samples exceeds a threshold determined by the geometric structure of the interpolator set in parameter space. As a proof technique, we leverage tools from algebraic geometry to mathematically characterize this geometric structure.

Paper Structure

This paper contains 50 sections, 16 theorems, 57 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

The following holds with probability $1$:

Figures (4)

  • Figure 1: Illustration of the IPS $\widehat{\Theta}_n$ approaches the TES $\bar{\Theta}$. As $n$ increases, fewer parameters $\theta$ achieve the interpolator, i.e. $\ell(y_i, f(x_i;\theta)) = 0, \forall i = 1,...,n$, causing $\widehat{\Theta}_n$ to converge to $\bar{\Theta}$. Note that while $\widehat{\Theta}_n$ and $\bar{\Theta}$ are plotted on the same plane in the right panel, $\widehat{\Theta}_n$ is actually higher dimensional than $\bar{\Theta}$.
  • Figure 2: Test losses of random near interpolators on 2-layer DLNN (left), 4-layer DLNN (middle), and 6-layer DLNN (right). The vertical axis represents the test loss, while the horizontal axis corresponds to the number of training data. The error bars indicate the standard deviation over $1000$ trials for each training sample size. The red vertical line is the theoretical upper bound of the strong sample complexity in Theorem \ref{['thm: DLNN']}.
  • Figure 3: Test losses of random near interpolators on LeNet. The vertical axis represents the test loss, while the horizontal axis corresponds to the number of training data. The error bars indicate the standard deviation over $2000$ trials for each training sample size. The red vertical line is the estimated upper bound of the strong sample complexity $d_\Theta-d_{\bar{\Theta}}+1$.
  • Figure 4: Test losses of random near interpolators on 2-layer FCDNN. The vertical axis represents the test loss, while the horizontal axis corresponds to the number of training data. The error bars indicate the standard deviation over $1000$ trials for each training sample size. The red vertical line is the estimated upper bound of the strong sample complexity $d_\Theta-d_{\bar{\Theta}}+1$.

Theorems & Definitions (37)

  • Theorem 1: Informal statement of Theorem \ref{['thm: main']}
  • Definition 1: Real analytic function
  • Definition 2: Real Analytic manifold
  • Definition 3: Dimension of TES
  • Definition 4: Strong sample complexity for a random interpolator
  • Theorem 2: Strong sample complexity: general case
  • Corollary 3
  • Proposition 4: The generalization error of the near interpolator
  • Definition 5: Real analytic set
  • Definition 6: Dimension of real analytic set
  • ...and 27 more