Uncertainty Quantification for Regression: A Unified Framework based on kernel scores
Christopher Bülte, Yusuf Sale, Gitta Kutyniok, Eyke Hüllermeier
TL;DR
This work addresses the challenge of uncertainty quantification in regression by unifying total, aleatoric, and epistemic uncertainty under a framework of strictly proper scoring rules built from kernel scores. By formalizing both Bayesian-model-average and pairwise-estimator schemes, and linking kernel properties to downstream behavior, the authors provide concrete design guidelines for task-specific uncertainty measures. The method encompasses existing measures (e.g., variance, entropy, energy distance, MMD) and enables robust, translation-invariant, and potentially task-adapted uncertainty through choices like the energy score or Gaussian kernel score. Empirical results on weather, UCI benchmarks, and active-learning tasks demonstrate robustness and the value of adapting measures to the task, with clear trade-offs between robustness, OOD responsiveness, and computational cost. Overall, the framework offers a principled way to tailor regression uncertainty to application requirements and paves the way for more flexible uncertainty representations in regression problems.
Abstract
Regression tasks, notably in safety-critical domains, require proper uncertainty quantification, yet the literature remains largely classification-focused. In this light, we introduce a family of measures for total, aleatoric, and epistemic uncertainty based on proper scoring rules, with a particular emphasis on kernel scores. The framework unifies several well-known measures and provides a principled recipe for designing new ones whose behavior, such as tail sensitivity, robustness, and out-of-distribution responsiveness, is governed by the choice of kernel. We prove explicit correspondences between kernel-score characteristics and downstream behavior, yielding concrete design guidelines for task-specific measures. Extensive experiments demonstrate that these measures are effective in downstream tasks and reveal clear trade-offs among instantiations, including robustness and out-of-distribution detection performance.
