Table of Contents
Fetching ...

CAAL: Confidence-Aware Active Learning for Heteroscedastic Atmospheric Regression

Fei Jiang, Jiyang Xia, Junjie Yu, Mingfei Sun, Hugh Coe, David Topping, Dantong Liu, Zhenhui Jessie Li, Zhonghua Zheng

TL;DR

This work tackles inferring hard-to-measure atmospheric particle properties from routine observations under heteroscedastic noise. It introduces CAAL, a Confidence-Aware Active Learning framework combining mean-variance decoupled training and a reliability-guided acquisition that weights epistemic uncertainty by predicted aleatoric noise. Empirical results on particle-resolved simulations and real observations show CAAL achieves higher predictive performance with substantially fewer labeled samples, particularly in highly heterogeneous regions, and maintains stability across settings. The approach offers a practical, broadly applicable pathway to scalable, data-rich atmospheric property databases for improved health and climate impact assessments.

Abstract

Quantifying the impacts of air pollution on health and climate relies on key atmospheric particle properties such as toxicity and hygroscopicity. However, these properties typically require complex observational techniques or expensive particle-resolved numerical simulations, limiting the availability of labeled data. We therefore estimate these hard-to-measure particle properties from routinely available observations (e.g., air pollutant concentrations and meteorological conditions). Because routine observations only indirectly reflect particle composition and structure, the mapping from routine observations to particle properties is noisy and input-dependent, yielding a heteroscedastic regression setting. With a limited and costly labeling budget, the central challenge is to select which samples to measure or simulate. While active learning is a natural approach, most acquisition strategies rely on predictive uncertainty. Under heteroscedastic noise, this signal conflates reducible epistemic uncertainty with irreducible aleatoric uncertainty, causing limited budgets to be wasted in noise-dominated regions. To address this challenge, we propose a confidence-aware active learning framework (CAAL) for efficient and robust sample selection in heteroscedastic settings. CAAL consists of two components: a decoupled uncertainty-aware training objective that separately optimises the predictive mean and noise level to stabilise uncertainty estimation, and a confidence-aware acquisition function that dynamically weights epistemic uncertainty using predicted aleatoric uncertainty as a reliability signal. Experiments on particle-resolved numerical simulations and real atmospheric observations show that CAAL consistently outperforms standard AL baselines. The proposed framework provides a practical and general solution for the efficient expansion of high-cost atmospheric particle property databases.

CAAL: Confidence-Aware Active Learning for Heteroscedastic Atmospheric Regression

TL;DR

This work tackles inferring hard-to-measure atmospheric particle properties from routine observations under heteroscedastic noise. It introduces CAAL, a Confidence-Aware Active Learning framework combining mean-variance decoupled training and a reliability-guided acquisition that weights epistemic uncertainty by predicted aleatoric noise. Empirical results on particle-resolved simulations and real observations show CAAL achieves higher predictive performance with substantially fewer labeled samples, particularly in highly heterogeneous regions, and maintains stability across settings. The approach offers a practical, broadly applicable pathway to scalable, data-rich atmospheric property databases for improved health and climate impact assessments.

Abstract

Quantifying the impacts of air pollution on health and climate relies on key atmospheric particle properties such as toxicity and hygroscopicity. However, these properties typically require complex observational techniques or expensive particle-resolved numerical simulations, limiting the availability of labeled data. We therefore estimate these hard-to-measure particle properties from routinely available observations (e.g., air pollutant concentrations and meteorological conditions). Because routine observations only indirectly reflect particle composition and structure, the mapping from routine observations to particle properties is noisy and input-dependent, yielding a heteroscedastic regression setting. With a limited and costly labeling budget, the central challenge is to select which samples to measure or simulate. While active learning is a natural approach, most acquisition strategies rely on predictive uncertainty. Under heteroscedastic noise, this signal conflates reducible epistemic uncertainty with irreducible aleatoric uncertainty, causing limited budgets to be wasted in noise-dominated regions. To address this challenge, we propose a confidence-aware active learning framework (CAAL) for efficient and robust sample selection in heteroscedastic settings. CAAL consists of two components: a decoupled uncertainty-aware training objective that separately optimises the predictive mean and noise level to stabilise uncertainty estimation, and a confidence-aware acquisition function that dynamically weights epistemic uncertainty using predicted aleatoric uncertainty as a reliability signal. Experiments on particle-resolved numerical simulations and real atmospheric observations show that CAAL consistently outperforms standard AL baselines. The proposed framework provides a practical and general solution for the efficient expansion of high-cost atmospheric particle property databases.
Paper Structure (36 sections, 36 equations, 9 figures, 9 tables)

This paper contains 36 sections, 36 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of the Confidence-Aware Active Learning (CAAL) framework.
  • Figure 2: $R^2$ for $\chi_a$ on the PartMC dataset versus the AL query budget, evaluated at the scenario level.
  • Figure 3: Mean epistemic uncertainty (a) and aleatoric uncertainty (b), averaged over the queried samples at each AL round on the PartMC dataset.
  • Figure 4: $R^2$ for $\chi_a$ on the PartMC dataset under CAAL with different $\beta$ values, evaluated at the scenario level.
  • Figure 5: Comparison of training objectives under CAAL for predicting $\chi_a$ on the PartMC dataset.
  • ...and 4 more figures