The Relative Instability of Model Comparison with Cross-validation
Alexandre Bayle, Lucas Janson, Lester Mackey
TL;DR
This paper investigates a surprising failure mode of cross-validation: even when each learning rule is individually stable, the comparison between two similar tuning-parameter models can be relatively unstable, invalidating CV-based inference for model improvement. By analyzing soft-thresholding and the Lasso in a fixed-dimensional linear model with Gaussian features and noise, the authors derive precise rates showing that relative instability arises in algorithm comparisons, with the variance and higher-moment measures failing to align with the CV CLT assumptions. They connect these findings to CV variance estimation, demonstrate undercoverage in CV intervals for differences between models, and propose a conservative CI construction that remains valid when each algorithm is stable on its own. The results caution practitioners to verify relative stability before using CV for model selection or hypothesis testing and point to future work on broad, verifiable stability conditions and improved inference techniques for algorithm comparisons.
Abstract
Cross-validation (CV) is known to provide asymptotically exact tests and confidence intervals for model improvement but only when the model comparison is relatively stable. Surprisingly, we prove that even simple, individually stable models can generate relatively unstable comparisons, calling into question the validity of CV inference. Specifically, we show that the Lasso and its close cousin, soft-thresholding, generate relatively unstable comparisons and invalid CV inferences, even in the most favorable of learning settings and when both models are individually stable. These findings highlight the importance of verifying relative stability before deploying CV for model comparison.
