Table of Contents
Fetching ...

Changing the Kernel During Training Leads to Double Descent in Kernel Regression

Oskar Allerbo

TL;DR

This work extends kernel regression by allowing the kernel bandwidth to decrease during training, linking bandwidth evolution to increasing model complexity and the occurrence of double descent. It provides theoretical generalization bounds for non-constant kernels and introduces a practical bandwidth-decreasing scheme (KGD-D) guided by training fit (R^2). Through synthetic and real-data experiments, the method often surpasses constant-bandwidth approaches and demonstrates benign overfitting when bandwidth tends to zero. The authors also discuss implications for neural networks via the neural tangent kernel (NTK), suggesting that controlled kernel evolution during training can improve generalization and training efficiency.

Abstract

We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.

Changing the Kernel During Training Leads to Double Descent in Kernel Regression

TL;DR

This work extends kernel regression by allowing the kernel bandwidth to decrease during training, linking bandwidth evolution to increasing model complexity and the occurrence of double descent. It provides theoretical generalization bounds for non-constant kernels and introduces a practical bandwidth-decreasing scheme (KGD-D) guided by training fit (R^2). Through synthetic and real-data experiments, the method often surpasses constant-bandwidth approaches and demonstrates benign overfitting when bandwidth tends to zero. The authors also discuss implications for neural networks via the neural tangent kernel (NTK), suggesting that controlled kernel evolution during training can improve generalization and training efficiency.

Abstract

We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.
Paper Structure (13 sections, 7 theorems, 59 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 13 sections, 7 theorems, 59 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

where $s_{\min}=s_{\min}(\bm{K})$ denotes the smallest singular value of $\bm{K}$.

Figures (7)

  • Figure 1: Top row: Inferred functions using KGD-D, KRR-GCV, and KRR-MML. When the bandwidth is allowed to change during training, all parts of the functions are captured by the model. In contrast, for a constant bandwidth, the inferred functions perform well on the high-frequency parts of the data, which has more observations, and poorly on the linear/low-frequency parts. Bottom row: Training error for KGD-D as a function of the bandwidth, which can be used to see which bandwidths that are used to model the data. For most values of $\sigma$, the errors decrease very slowly, if at all, with distinct drops at some bandwidths, that depend on the frequencies of the sine functions.
  • Figure 2: Inferred KGD-D functions for five different training times, where lower panels correspond to longer training times. Initially, the inferred functions are almost linear, but the complexities increase during training. Simpler parts of the data are captured earlier during training. Eventually, the models perfectly interpolate the training data.
  • Figure 3: Training and test errors (as $1-R^2$) as functions of model complexity (in terms of $\sigma_m$), for kernel regression with constant and decreasing bandwidths. The plots show the means, together with the 90% prediction intervals. For simple models, both the training and test errors are large, but they all decrease with increasing model complexity. While the training errors decrease toward zero, the test errors start to increase again as the models become even more complex. When using a decreasing bandwidth, a second descent in the error results in good generalization for very complex models, something that is not the case for the constant bandwidth model.
  • Figure 4: Inferred models for the data generated according to Equation \ref{['eq:syn_dd']}, for different model complexities, corresponding to the four cases in Table \ref{['tab:double_descent']}. For $\sigma_m=1$, the models are too simple and perform badly on both training and test data. For $\sigma_m=0.5$, the models do well on both training and test data. For $\sigma_m=0.1$, the models perfectly explain the training data but tend to exhibit extreme predictions between observations. For $\sigma_m=0.01$, both models perfectly explain the training data, but in contrast to the decreasing bandwidth model, the constant bandwidth model tends to generalize poorly.
  • Figure 5: Inferred functions on the same data as in Figure \ref{['fig:syn_dec_100']}, for the four kernels in Table \ref{['tab:kerns']}. Regardless of the kernel, the functions tend to be similar, with one exception: For small values of $\nu$, which corresponds to lower smoothness of the kernel (in terms of differentiability), constant bandwidth functions tend to linearly interpolate the data.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Lemma 2
  • proof : Proof of Lemma \ref{['thm:const_bound']}
  • Lemma 3
  • proof : Proof
  • proof : Proof of Proposition \ref{['thm:change_bound']}
  • Lemma 4
  • ...and 4 more