Table of Contents
Fetching ...

Early stopping and polynomial smoothing in regression with reproducing kernels

Yaroslav Averyanov, Alain Celisse

TL;DR

This paper presents a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle and is proved to be minimax optimal over different types of kernel spaces, including finite rank and Sobolev smoothness classes.

Abstract

In this paper, we study the problem of early stopping for iterative learning algorithms in a reproducing kernel Hilbert space (RKHS) in the nonparametric regression framework. In particular, we work with the gradient descent and (iterative) kernel ridge regression algorithms. We present a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle. This method enjoys only one assumption on the regression function: it belongs to a reproducing kernel Hilbert space (RKHS). The proposed rule is proved to be minimax-optimal over different types of kernel spaces, including finite-rank and Sobolev smoothness classes. The proof is derived from the fixed-point analysis of the localized Rademacher complexities, which is a standard technique for obtaining optimal rates in the nonparametric regression literature. In addition to that, we present simulation results on artificial datasets that show the comparable performance of the designed rule with respect to other stopping rules such as the one determined by V-fold cross-validation.

Early stopping and polynomial smoothing in regression with reproducing kernels

TL;DR

This paper presents a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle and is proved to be minimax optimal over different types of kernel spaces, including finite rank and Sobolev smoothness classes.

Abstract

In this paper, we study the problem of early stopping for iterative learning algorithms in a reproducing kernel Hilbert space (RKHS) in the nonparametric regression framework. In particular, we work with the gradient descent and (iterative) kernel ridge regression algorithms. We present a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle. This method enjoys only one assumption on the regression function: it belongs to a reproducing kernel Hilbert space (RKHS). The proposed rule is proved to be minimax-optimal over different types of kernel spaces, including finite-rank and Sobolev smoothness classes. The proof is derived from the fixed-point analysis of the localized Rademacher complexities, which is a standard technique for obtaining optimal rates in the nonparametric regression literature. In addition to that, we present simulation results on artificial datasets that show the comparable performance of the designed rule with respect to other stopping rules such as the one determined by V-fold cross-validation.

Paper Structure

This paper contains 42 sections, 23 theorems, 171 equations, 4 figures.

Key Result

Theorem 3.1

Under Assumptions a1 and a2, given the stopping rule (tau), for any positive $\theta$.

Figures (4)

  • Figure 1: Bias, variance, risk, and empirical risk behavior.
  • Figure 2: Histogram of $\tau$ vs $t^*$ vs $t^b \coloneqq \inf \{ t > 0 \ | \ B^2(t) \leq V(t) \}$ vs $t_{\textnormal{or}} \coloneqq \underset{t > 0}{\textnormal{argmin}} \left[ \mathbb{E}_{\varepsilon}\lVert f^t - f^* \rVert_n^2 \right]$ for kernel gradient descent with the step-size $\eta = 1 / (1.2 \widehat{\mu}_1)$ for the piece-wise linear $f^*(x) = |x - 1/2| - 1/2$ (panel (a)) and heavisine $f^*(x) = 0.093 \ [4 \ \textnormal{sin}(4 \pi x) - \textnormal{sign}(x - 0.3) - \textnormal{sign}(0.72 - x)]$ (panel (b)) regression functions, and the first-order Sobolev kernel $\mathbb{K}(x_1, x_2) = \min \{x_1, x_2 \}$.
  • Figure 3: Kernel gradient descent with the step-size $\eta = 1 / (1.2 \widehat{\mu}_1)$ and polynomial kernel $\mathbb{K}(x_1, x_2) = (1 + x_1^{\top}x_2)^3, \ x_1, x_2 \in [0, 1]$, for the estimation of two noised regression functions: the smooth $f^*(x) = |x - 1/2| - 1/2$ for panel (a), and the "sinus" $f^*(x) = 0.4 \ \textnormal{sin}(4 \pi x)$ for panel (b), with the equidistant covariates $x_j = j/n$. Each curve corresponds to the $L_2(\mathbb{P}_n)$ squared norm error for the stopping rules (\ref{['t_or']}), (\ref{['t_star']}), (\ref{['t_w']}), (\ref{['t_vf']}), (\ref{['tau']}) averaged over $100$ independent trials, versus the sample size $n = \{40, 80, 120, 200, 320, 400 \}$.
  • Figure 4: Kernel gradient descent (\ref{['iterations']}) with the step-size $\eta = 1 / (1.2 \widehat{\mu}_1)$ and Sobolev kernel $\mathbb{K}(x_1, x_2) = \min \{ x_1, x_2\}, \ x_1, x_2 \in [0, 1]$ for the estimation of two noised regression functions: the smooth $f^*(x) = |x - 1/2| - 1/2$ for panel (a) and the "sinus" $f^*(x) = 0.4 \ \textnormal{sin}(4\pi x)$ for panel (b), with the equidistant covariates $x_j = j/n$. Each curve corresponds to the $L_2(\mathbb{P}_n)$ squared norm error for the stopping times (\ref{['t_or']}), (\ref{['t_star']}), (\ref{['t_w']}), (\ref{['t_ho']}), (\ref{['t_alpha']}) with $\alpha = 0.33$, averaged over $100$ independent trials, versus the sample size $n = \{40, 80, 120, 200, 320, 400 \}$.

Theorems & Definitions (44)

  • Definition 2.1
  • Theorem 3.1
  • proof : Proof of Theorem \ref{['th:1']}
  • Corollary 3.2
  • proof : Proof of Corollary \ref{['corollary_empirical_norm']}
  • Theorem 3.3
  • Remark
  • Corollary 3.4
  • Theorem 4.1: Lower bound from Theorem 1 in yang2017randomized
  • Example 1: $\beta$-polynomial eigenvalue decay kernels
  • ...and 34 more