Table of Contents
Fetching ...

Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI

Aleksis Pirinen, Nosheen Abid, Nuria Agues Paszkowsky, Thomas Ohlson Timoudas, Ronald Scheirer, Chiara Ceccobello, György Kovács, Anders Persson

TL;DR

A novel synthetic dataset for COT estimation is proposed, which is subsequently leverage for obtaining reliable and versatile cloud masks on real data and shows on two satellite image datasets that reliable cloud masks can be obtained.

Abstract

Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true when it comes to cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, that we subsequently leverage for obtaining reliable and versatile cloud masks on real data. In our dataset, top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multispectral Imagery (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. In particular, by thresholding COT estimates from our ML models, we show on two satellite image datasets (one that is publicly available, and one which we have collected and annotated) that reliable cloud masks can be obtained. The synthetic data, the collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.

Creating and Leveraging a Synthetic Dataset of Cloud Optical Thickness Measures for Cloud Detection in MSI

TL;DR

A novel synthetic dataset for COT estimation is proposed, which is subsequently leverage for obtaining reliable and versatile cloud masks on real data and shows on two satellite image datasets that reliable cloud masks can be obtained.

Abstract

Cloud formations often obscure optical satellite-based monitoring of the Earth's surface, thus limiting Earth observation (EO) activities such as land cover mapping, ocean color analysis, and cropland monitoring. The integration of machine learning (ML) methods within the remote sensing domain has significantly improved performance on a wide range of EO tasks, including cloud detection and filtering, but there is still much room for improvement. A key bottleneck is that ML methods typically depend on large amounts of annotated data for training, which is often difficult to come by in EO contexts. This is especially true when it comes to cloud optical thickness (COT) estimation. A reliable estimation of COT enables more fine-grained and application-dependent control compared to using pre-specified cloud categories, as is commonly done in practice. To alleviate the COT data scarcity problem, in this work we propose a novel synthetic dataset for COT estimation, that we subsequently leverage for obtaining reliable and versatile cloud masks on real data. In our dataset, top-of-atmosphere radiances have been simulated for 12 of the spectral bands of the Multispectral Imagery (MSI) sensor onboard Sentinel-2 platforms. These data points have been simulated under consideration of different cloud types, COTs, and ground surface and atmospheric profiles. Extensive experimentation of training several ML models to predict COT from the measured reflectivity of the spectral bands demonstrates the usefulness of our proposed dataset. In particular, by thresholding COT estimates from our ML models, we show on two satellite image datasets (one that is publicly available, and one which we have collected and annotated) that reliable cloud masks can be obtained. The synthetic data, the collected real dataset, code and models have been made publicly available at https://github.com/aleksispi/ml-cloud-opt-thick.
Paper Structure (11 sections, 2 equations, 5 figures, 4 tables)

This paper contains 11 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Distribution of COTs in the range $[0,50]$ for different cloud types in our dataset: water clouds (left), ice clouds (center), and combined water and ice clouds (right). The focus of our data is on optically thin clouds.
  • Figure 2: Main MLP model architecture that we use for COT estimation. The input layer is shown on the left (4 nodes shown for the input $\boldsymbol{x}$ instead of all 12, to avoid visual clutter). Then follows four hidden layers (8 nodes shown for each, instead of 64). Finally, the output layer on the right produces the COT estimate $\hat{y}$. Each of the five layers is of the form $\boldsymbol{z}^{k} = \mathrm{ReLU}{}(\boldsymbol{W}^k\boldsymbol{z}^{k-1} + \boldsymbol{b}^k) = \mathrm{max}(\boldsymbol{0}, \boldsymbol{W}^k\boldsymbol{z}^{k-1} + \boldsymbol{b}^k)$ for $k=1,\dots,5$, with $\boldsymbol{z}^0 = \boldsymbol{x}$ and $\boldsymbol{z}^5=\hat{y}$. Here, $\boldsymbol{W}^k$ and $\boldsymbol{b}^k$ respectively represent the weight matrix and bias vector for the $k$:th layer, and these parameters are calibrated through the training process (see §\ref{['sec:training']}). Note that we use an MLP, rather than e.g. a convolutional architecture, since the data points $\boldsymbol{x}_i$ are independent from each other in our synthetic dataset.
  • Figure 3: Left: Examples of our main MLP approach (ensemble of ten 5-layer MLPs) on unseen KappaZeta test data. Column 1: Input image (only RGB is shown). Column 2: COT estimates (relative intensity scaling, to more clearly see variations). Column 3: Pixel-level cloud type predictions based on thresholding the COTs in column 2. Column 4: KappaZeta ground truth. Dark blue is clear sky, lighter blue is semi-transparent cloud, and turquoise is opaque cloud. Right: Similar to the left, but the 2nd column shows the thresholded model predictions (instead of the COT estimates), and the 3rd column is the U-net prediction. A failure case for both models is shown on the bottom row.
  • Figure 4: Eight additional qualitative examples on the KappaZeta test set (examples are shown in the same format on the left and right side of this figure). In each example, columns 1 and 4 are the same as in Fig. \ref{['fig:several-with-thick']}, while column 2 shows pixel-level cloud type predictions based on thresholding the COTs of an MLP ensemble approach that was trained only on our synthetic data. Column 3 is the same as column 2, but the MLPs were refined on KappaZeta training data. Fine-tuning sometimes yields better results (e.g. top two rows on the left). In many cases, the results are however very similar before and after fine-tuning (e.g. third row on the left and right side), and sometimes results get worse after fine-tuning (e.g. fourth row on the left).
  • Figure 5: Locations in Sweden of the annotated imagery provided by the SFA. Larger dots indicate a higher density of images in the associated region.