Table of Contents
Fetching ...

How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

Yuanyuan Wang, Qian Song, Dawood Wasif, Muhammad Shahzad, Christoph Koller, Jonathan Bamber, Xiao Xiang Zhu

TL;DR

The paper tackles the lack of reliable UQ benchmarks for Earth Observation ML by introducing three novel datasets—RegressionUQ, SegmentationUQ, and ClassificationUQ—that provide ground-truth or label uncertainty to enable direct evaluation of UQ methods. Each dataset uses a principled reference uncertainty: a known physical model with Monte Carlo propagation for regression, a neural network with entropy-based reference for segmentation, and distributional human labels with KL-based training for classification. The authors demonstrate baseline UQ methods (e.g., Bayesian networks, dropout, TTA, and distributional label learning) and reveal how dataset design (noise types, input/output distributions, and training data size) affects uncertainty estimation, calibration, and generalization. The work offers practical resources and insights to advance reliable uncertainty quantification in EO products and downstream applications.

Abstract

Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty.

How Certain are Uncertainty Estimates? Three Novel Earth Observation Datasets for Benchmarking Uncertainty Quantification in Machine Learning

TL;DR

The paper tackles the lack of reliable UQ benchmarks for Earth Observation ML by introducing three novel datasets—RegressionUQ, SegmentationUQ, and ClassificationUQ—that provide ground-truth or label uncertainty to enable direct evaluation of UQ methods. Each dataset uses a principled reference uncertainty: a known physical model with Monte Carlo propagation for regression, a neural network with entropy-based reference for segmentation, and distributional human labels with KL-based training for classification. The authors demonstrate baseline UQ methods (e.g., Bayesian networks, dropout, TTA, and distributional label learning) and reveal how dataset design (noise types, input/output distributions, and training data size) affects uncertainty estimation, calibration, and generalization. The work offers practical resources and insights to advance reliable uncertainty quantification in EO products and downstream applications.

Abstract

Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via https://gitlab.lrz.de/ai4eo/WG_Uncertainty.

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Error sources of predictive uncertainty in EO with a machine learning model: 1. the noise in the observations $\mathbf{x}$, 2. the structural uncertainty in the model $\textit{F}$, e.g. its network architecture, and 3. the noise in the training data $\textbf{D}$.
  • Figure 2: Illustration of the $checkerboard$ training and test set split strategy on the SegmentationUQ dataset. The dots in the 2D plot are the tree diameter and height samples. They are sampled from gamma distributions fitted to the Chave dataset. Dark green and light green represent training and test set, respectively.
  • Figure 3: The calculation of the reference variance of the biomass prediction. It demonstrates with a one-dimensional model $B=f(H)$. With a defined physical model, the noise at $H_0$ shown as the black Gaussian curve below the x-axis transforms to the black Gaussian curve along the y-axis through the equation. Differently, for a data-driven model trained from a wide range of data shown as a combined distribution of black and green Gaussians below the x-axis, the output uncertainty is also a mixture of all the distributions.
  • Figure 4: Histograms of correlation coefficients between estimated aleatoric uncertainty and GT data uncertainty at pixel level using different UQ methods.
  • Figure 5: A collection of sample variations of the dataset under three different noise types i.e. 3D viewpoint variation (3D VPV), Gaussian, and Poisson noise. Each sample is subjected to specific intensities of noise distribution determined through parameters such as standard deviation for Gaussian and lambda for Poisson distribution, set at incremental levels of 1, 2, 4, and 8. This arrangement effectively showcases the distinct impacts of image noise on the dataset, highlighting the variations in imagery under different noise conditions.
  • ...and 1 more figures