Table of Contents
Fetching ...

The Relative Instability of Model Comparison with Cross-validation

Alexandre Bayle, Lucas Janson, Lester Mackey

TL;DR

This paper investigates a surprising failure mode of cross-validation: even when each learning rule is individually stable, the comparison between two similar tuning-parameter models can be relatively unstable, invalidating CV-based inference for model improvement. By analyzing soft-thresholding and the Lasso in a fixed-dimensional linear model with Gaussian features and noise, the authors derive precise rates showing that relative instability arises in algorithm comparisons, with the variance and higher-moment measures failing to align with the CV CLT assumptions. They connect these findings to CV variance estimation, demonstrate undercoverage in CV intervals for differences between models, and propose a conservative CI construction that remains valid when each algorithm is stable on its own. The results caution practitioners to verify relative stability before using CV for model selection or hypothesis testing and point to future work on broad, verifiable stability conditions and improved inference techniques for algorithm comparisons.

Abstract

Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

The Relative Instability of Model Comparison with Cross-validation

TL;DR

This paper investigates a surprising failure mode of cross-validation: even when each learning rule is individually stable, the comparison between two similar tuning-parameter models can be relatively unstable, invalidating CV-based inference for model improvement. By analyzing soft-thresholding and the Lasso in a fixed-dimensional linear model with Gaussian features and noise, the authors derive precise rates showing that relative instability arises in algorithm comparisons, with the variance and higher-moment measures failing to align with the CV CLT assumptions. They connect these findings to CV variance estimation, demonstrate undercoverage in CV intervals for differences between models, and propose a conservative CI construction that remains valid when each algorithm is stable on its own. The results caution practitioners to verify relative stability before using CV for model selection or hypothesis testing and point to future work on broad, verifiable stability conditions and improved inference techniques for algorithm comparisons.

Abstract

Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.

Paper Structure

This paper contains 26 sections, 17 theorems, 154 equations, 5 figures.

Key Result

Lemma 2.3

For any $\lambda_n > 0$, $\mathbf{Y}\in\mathbb{R}^n$, and $\mathbf{X}\in\mathbb{R}^{n\times p}$, the ST($\lambda_n$) estimator $\hat{\beta}_{\lambda_n}$eq:st and Lasso($\lambda_n$) estimator $\hat{\beta}^{\textsc{lasso}}_{\lambda_n}\in \mathop\mathrm{arg min}_{\beta\in\mathbb{R}^p} \frac{1}{2n} \|{\ where $\mu_n \triangleq \lambda_{\mathrm{min}}\mathopen{}\mathclose{\left({\mathbf{X}^\top \mathbf{

Figures (5)

  • Figure 1: The cross-validation central limit theorem BBJM:2020 yields accurate coverage for the relatively stable Lasso algorithm but severely undercovers for the relatively unstable comparison of two Lasso fits. See \ref{['sec:experiment-details-coverage-prob-fig']} for full experiment details.
  • Figure 2: ST with $\lambda_n = \sqrt{n}$ when $\beta^\star = (3, 1, -5, 3, 0, 0, 0, 0, 0, 0)$. Top:$\sigma^2(h_n)$, $\gamma(h_n)$ and $r(h_n)$ all normalized by their values at $n = 900$. Bottom: (best viewed in color) KDE plots for $\frac{\sqrt{\frac{n k}{k-1}}}{\hat{\sigma}_n(h_n)} (\hat{R}_n - R_n)$ (solid curves) and $\frac{\sqrt{\frac{n k}{k-1}}}{\sigma(h_n)} (\hat{R}_n - R_n)$ (dashed curves).
  • Figure 3: Lasso with cross-validated $\lambda_n$ when $\beta^\star = (3, 1, -5, 3, 0, 0, 0, 0, 0, 0)$. Top:$\sigma^2(h_n)$, $\gamma(h_n)$ and $r(h_n)$ all normalized by their values at $n = 900$. Bottom: (best viewed in color) KDE plots for $\frac{\sqrt{\frac{n k}{k-1}}}{\hat{\sigma}_n(h_n)} (\hat{R}_n - R_n)$ (solid curves) and $\frac{\sqrt{\frac{n k}{k-1}}}{\sigma(h_n)} (\hat{R}_n - R_n)$ (dashed curves).
  • Figure 4: Ridge regression with $\lambda_n = \sqrt{n}$ when $\beta^\star = (3, 1, -5, 3, 0, 0, 0, 0, 0, 0)$. Top:$\sigma^2(h_n)$, $\gamma(h_n)$ and $r(h_n)$ all normalized by their values at $n = 900$. Bottom: (best viewed in color) KDE plots for $\frac{\sqrt{\frac{n k}{k-1}}}{\hat{\sigma}_n(h_n)} (\hat{R}_n - R_n)$ (solid curves) and $\frac{\sqrt{\frac{n k}{k-1}}}{\sigma(h_n)} (\hat{R}_n - R_n)$ (dashed curves).
  • Figure 5: ST with $\lambda_n = \sqrt{n}$ when $\beta^\star = (3, 1, -5, 3, 4, -3, 10, 8, 5, 2)$. Top:$\sigma^2(h_n)$, $\gamma(h_n)$ and $r(h_n)$ all normalized by their values at $n = 900$ for single algorithm and at $n = 9000$ for comparison. Bottom: (best viewed in color) KDE plots for $\frac{\sqrt{\frac{n k}{k-1}}}{\hat{\sigma}_n(h_n)} (\hat{R}_n - R_n)$ (solid curves) and $\frac{\sqrt{\frac{n k}{k-1}}}{\sigma(h_n)} (\hat{R}_n - R_n)$ (dashed curves).

Theorems & Definitions (19)

  • Definition 2.1: Relative loss stability
  • Definition 2.2: Soft-thresholding (ST)
  • Lemma 2.3: Lasso-ST proximity
  • Theorem 3.1: Relative instability of ST comparisons
  • Theorem 3.2: Relative stability of ST
  • Theorem 3.3: Relative instability of Lasso comparisons
  • Theorem 3.4: Relative stability of the Lasso
  • Proposition 6.1: Comparison coverage from single algorithm coverage
  • Proposition 4.1: Convergence rate of $\sigma^2(h_n^{\mathrm{diff}})$ for comparison of ST($\lambda_n$) with ST$(\lambda_n + \delta_n)$
  • Proposition 4.2: Lower-bounding rate of $\gamma(h_n^{\mathrm{diff}})$ for comparison of ST($\lambda_n$) with ST$(\lambda_n + \delta_n)$
  • ...and 9 more