Changing the Kernel During Training Leads to Double Descent in Kernel Regression
Oskar Allerbo
TL;DR
This work extends kernel regression by allowing the kernel bandwidth to decrease during training, linking bandwidth evolution to increasing model complexity and the occurrence of double descent. It provides theoretical generalization bounds for non-constant kernels and introduces a practical bandwidth-decreasing scheme (KGD-D) guided by training fit (R^2). Through synthetic and real-data experiments, the method often surpasses constant-bandwidth approaches and demonstrates benign overfitting when bandwidth tends to zero. The authors also discuss implications for neural networks via the neural tangent kernel (NTK), suggesting that controlled kernel evolution during training can improve generalization and training efficiency.
Abstract
We investigate changing the bandwidth of a translational-invariant kernel during training when solving kernel regression with gradient descent. We present a theoretical bound on the out-of-sample generalization error that advocates for decreasing the bandwidth (and thus increasing the model complexity) during training. We further use the bound to show that kernel regression exhibits a double descent behavior when the model complexity is expressed as the minimum allowed bandwidth during training. Decreasing the bandwidth all the way to zero results in benign overfitting, and also circumvents the need for model selection. We demonstrate the double descent behavior on real and synthetic data and also demonstrate that kernel regression with a decreasing bandwidth outperforms that of a constant bandwidth, selected by cross-validation or marginal likelihood maximization. We finally apply our findings to neural networks, demonstrating that by modifying the neural tangent kernel (NTK) during training, making the NTK behave as if its bandwidth were decreasing to zero, we can make the network overfit more benignly, and converge in fewer iterations.
