How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning
Yuanyuan Wang, Qian Song, Dawood Wasif, Muhammad Shahzad, Christoph Koller, Jonathan Bamber, Xiao Xiang Zhu
TL;DR
The paper tackles the lack of reliable UQ benchmarks for Earth Observation ML by introducing three novel datasets—RegressionUQ, SegmentationUQ, and ClassificationUQ—that provide ground-truth or label uncertainty to enable direct evaluation of UQ methods. Each dataset uses a principled reference uncertainty: a known physical model with Monte Carlo propagation for regression, a neural network with entropy-based reference for segmentation, and distributional human labels with KL-based training for classification. The authors demonstrate baseline UQ methods (e.g., Bayesian networks, dropout, TTA, and distributional label learning) and reveal how dataset design (noise types, input/output distributions, and training data size) affects uncertainty estimation, calibration, and generalization. The work offers practical resources and insights to advance reliable uncertainty quantification in EO products and downstream applications.
Abstract
Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty.
