Valid prediction intervals for regression problems
Nicolas Dewolf, Bernard De Baets, Willem Waegeman
TL;DR
This paper surveys prediction-interval calibration for regression across four methodological classes and empirically benchmarks their performance. It formalizes prediction intervals via coverage $C(\Gamma, P) \ge 1 - \alpha$ and width minimization, and discusses marginal vs. asymptotic validity. The key contribution is a structured comparison showing that conformal prediction provides a robust, model-agnostic calibration path that yields intervals with nominal coverage, often improving or matching calibrated intervals across diverse datasets; it also highlights how Gaussian or other distributional assumptions can degrade interval validity on skewed data. The findings guide practitioners in choosing calibrated interval methods, emphasizing post-hoc conformal calibration to achieve reliable coverage while controlling interval width, and noting trade-offs in scalability and dependence on data properties. Overall, the paper clarifies how calibration interacts with data-generating processes and model classes to produce reliable regression uncertainty quantification with practical implications for high-stakes applications.
Abstract
Over the last few decades, various methods have been proposed for estimating prediction intervals in regression settings, including Bayesian methods, ensemble methods, direct interval estimation methods and conformal prediction methods. An important issue is the calibration of these methods: the generated prediction intervals should have a predefined coverage level, without being overly conservative. In this work, we review the above four classes of methods from a conceptual and experimental point of view. Results on benchmark data sets from various domains highlight large fluctuations in performance from one data set to another. These observations can be attributed to the violation of certain assumptions that are inherent to some classes of methods. We illustrate how conformal prediction can be used as a general calibration procedure for methods that deliver poor results without a calibration step.
