Generative Modeling of Aerosol State Representations
Ehsan Saleh, Saba Ghaffari, Jeffrey H. Curtis, Lekha Patel, Peter A. Bosler, Nicole Riemer, Matthew West
TL;DR
This paper tackles the high dimensionality of aerosol state representations by learning compact, physically meaningful latent representations using a variational autoencoder (VAE). It demonstrates that hundreds of input features corresponding to speciated mass and number distributions can be compressed to $10$ latent variables while preserving important diagnostics such as CCN spectra, optical properties, and ice nucleation. A noise-resilient preprocessing strategy and a novel realism metric based on sliced Wasserstein distance are introduced to improve robustness and realism of generated aerosols, enabling surrogate modeling for climate applications. The work provides a path toward efficient, scalable aerosol representations and outlines future directions for time-evolving surrogates and diagnostic-weighted training to further enhance ice nucleation predictions.
Abstract
Aerosol-cloud--radiation interactions remain among the most uncertain components of the Earth's climate system, in partdue to the high dimensionality of aerosol state representations and the difficulty of obtaining complete \textit{in situ} measurements. Addressing these challenges requires methods that distill complex aerosol properties into compact yet physically meaningful forms. Generative autoencoder models provide such a pathway. We present a framework for learning deep variational autoencoder (VAE) models of speciated mass and number concentration distributions, which capture detailed aerosol size-composition characteristics. By compressing hundreds of original dimensions into ten latent variables, the approach enables efficient storage and processing while preserving the fidelity of key diagnostics, including cloud condensation nuclei (CCN) spectra, optical scattering and absorption coefficients, and ice nucleation properties. Results show that CCN spectra are easiest to reconstruct accurately, optical properties are moderately difficult, and ice nucleation properties are the most challenging. To improve performance, we introduce a preprocessing optimization strategy that avoids repeated retraining and yields latent representations resilient to high-magnitude Gaussian noise, boosting accuracy for CCN spectra, optical coefficients, and frozen fraction spectra. Finally, we propose a novel realism metric -- based on the sliced Wasserstein distance between generated samples and a held-out test set -- for optimizing the KL divergence weight in VAEs. Together, these contributions enable compact, robust, and physically meaningful representations of aerosol states for large-scale climate applications.
