Table of Contents
Fetching ...

Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression

Motonobu Kanagawa

TL;DR

This work addresses the high computational cost of leave-one-out cross-validation (LOOCV) for $k$-NN regression by deriving a exact fast LOOCV formula. Under a tie-breaking assumption, LOOCV for a given $k$ equals the mean squared error of $(k+1)$-NN regression on the full training set, scaled by $\left(\frac{k+1}{k}\right)^2$, enabling a single $(k+1)$-NN fit to compute LOOCV for all $k$. The key contribution is the corollary ${\rm LOOCV}(k,D_n) = \left(\frac{k+1}{k}\right)^2 \frac{1}{n} \sum_{\ell=1}^n (\hat f_{k+1,D_n}(x_\ell) - y_\ell)^2$, along with empirical validation on real datasets and a discussion of the tie-breaking condition. This fast LOOCV computation facilitates rapid hyperparameter tuning and opens avenues for optimizing distance metrics in addition to $k$.

Abstract

We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method.

Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression

TL;DR

This work addresses the high computational cost of leave-one-out cross-validation (LOOCV) for -NN regression by deriving a exact fast LOOCV formula. Under a tie-breaking assumption, LOOCV for a given equals the mean squared error of -NN regression on the full training set, scaled by , enabling a single -NN fit to compute LOOCV for all . The key contribution is the corollary , along with empirical validation on real datasets and a discussion of the tie-breaking condition. This fast LOOCV computation facilitates rapid hyperparameter tuning and opens avenues for optimizing distance metrics in addition to .

Abstract

We describe a fast computation method for leave-one-out cross-validation (LOOCV) for -nearest neighbours (-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for -NN regression is identical to the mean square error of -NN regression evaluated on the training data, multiplied by the scaling factor . Therefore, to compute the LOOCV score, one only needs to fit -NN regression only once, and does not need to repeat training-validation of -NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method.
Paper Structure (8 sections, 2 theorems, 14 equations, 4 figures)

This paper contains 8 sections, 2 theorems, 14 equations, 4 figures.

Key Result

Lemma 1

Under Assumption as:train-input-points, we have, for all $k \in \mathbb{N}$ and $\ell = 1, \dots, n$,

Figures (4)

  • Figure 1: Illustration of $k=3$ nearest neighbours of $x_\ell$. The blue point represents $x_\ell$, the three red points are the $k=3$ nearest neighbours of $x_\ell$ in $X_n \backslash \{ x_\ell \}$, the black points are other points in $X_n \backslash \{ x_\ell \}$, and the red circle is the sphere of radius equal to the distance between $x_\ell$ and its third nearest neighbour in $X_n \backslash \{ x_\ell \}$ (the red point on the circle).
  • Figure 2: Experimental results on the Diabetes dataset. The left figure shows the LOOCV scores \ref{['eq:LOOCV-naive']} computed in the brute-force manner ("LOOCV-Brute") and by using the derived formula \ref{['eq:LOOCV-fast-formula']} for different values of $k$. The right figure shows the computation times of either approach for different data sizes $n$ for fixed $k=5$.
  • Figure 3: Experimental results on the Wine dataset. The left figure shows the LOOCV scores \ref{['eq:LOOCV-naive']} computed in the brute-force manner ("LOOCV-Brute") and by using the derived formula \ref{['eq:LOOCV-fast-formula']} for different values of $k$. The right figure shows the computation times of either approach for different data sizes $n$ for fixed $k=5$.
  • Figure 4: LOOCV scores for the Diabetes dataset (left) and the Wine dataset (right), each of which only uses one input feature. The used input feature has many duplicates and thus does not satisfy the tie-breaking condition in Assumption \ref{['as:train-input-points']}. Left: The best $k$ with the lowest LOOCV score is 17 for both LOOCV-Brute and LOOCV-Efficient. Right: The best $k$ with the lowest LOOCV score is 21 for LOOCV-Brute and 17 for LOOCV-Efficient.

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Corollary 1
  • proof