Table of Contents
Fetching ...

Improving Satellite Imagery Masking using Multi-task and Transfer Learning

Rangel Daroya, Luisa Vieira Lucchese, Travis Simmons, Punwath Prum, Tamlin Pavelsky, John Gardner, Colin J. Gleason, Subhransu Maji

TL;DR

This work tackles the challenge of masking satellite imagery for downstream SSC estimation by predicting multiple masks simultaneously from Harmonized Landsat-Sentinel data. It introduces a multi-task deep learning framework with a shared backbone and per-mask heads, trained with transfer learning from large pre-training datasets, and compares CNN and transformer architectures. The approach yields a 9% F1 gain on water masking, up to a 30× speedup in the SSC pipeline, and a 2.64 mg/L improvement in SSC accuracy, while reducing memory and storage demands. The results demonstrate that end-to-end, multi-task masking enables global-scale, efficient, and more accurate surface water analyses, with practical guidance on model choice and training strategy for operational deployment.

Abstract

Many remote sensing applications employ masking of pixels in satellite imagery for subsequent measurements. For example, estimating water quality variables, such as Suspended Sediment Concentration (SSC) requires isolating pixels depicting water bodies unaffected by clouds, their shadows, terrain shadows, and snow and ice formation. A significant bottleneck is the reliance on a variety of data products (e.g., satellite imagery, elevation maps), and a lack of precision in individual steps affecting estimation accuracy. We propose to improve both the accuracy and computational efficiency of masking by developing a system that predicts all required masks from Harmonized Landsat and Sentinel (HLS) imagery. Our model employs multi-tasking to share computation and enable higher accuracy across tasks. We experiment with recent advances in deep network architectures and show that masking models can benefit from these, especially when combined with pre-training on large satellite imagery datasets. We present a collection of models offering different speed/accuracy trade-offs for masking. MobileNet variants are the fastest, and perform competitively with larger architectures. Transformer-based architectures are the slowest, but benefit the most from pre-training on large satellite imagery datasets. Our models provide a 9% F1 score improvement compared to previous work on water pixel identification. When integrated with an SSC estimation system, our models result in a 30x speedup while reducing estimation error by 2.64 mg/L, allowing for global-scale analysis. We also evaluate our model on a recently proposed cloud and cloud shadow estimation benchmark, where we outperform the current state-of-the-art model by at least 6% in F1 score.

Improving Satellite Imagery Masking using Multi-task and Transfer Learning

TL;DR

This work tackles the challenge of masking satellite imagery for downstream SSC estimation by predicting multiple masks simultaneously from Harmonized Landsat-Sentinel data. It introduces a multi-task deep learning framework with a shared backbone and per-mask heads, trained with transfer learning from large pre-training datasets, and compares CNN and transformer architectures. The approach yields a 9% F1 gain on water masking, up to a 30× speedup in the SSC pipeline, and a 2.64 mg/L improvement in SSC accuracy, while reducing memory and storage demands. The results demonstrate that end-to-end, multi-task masking enables global-scale, efficient, and more accurate surface water analyses, with practical guidance on model choice and training strategy for operational deployment.

Abstract

Many remote sensing applications employ masking of pixels in satellite imagery for subsequent measurements. For example, estimating water quality variables, such as Suspended Sediment Concentration (SSC) requires isolating pixels depicting water bodies unaffected by clouds, their shadows, terrain shadows, and snow and ice formation. A significant bottleneck is the reliance on a variety of data products (e.g., satellite imagery, elevation maps), and a lack of precision in individual steps affecting estimation accuracy. We propose to improve both the accuracy and computational efficiency of masking by developing a system that predicts all required masks from Harmonized Landsat and Sentinel (HLS) imagery. Our model employs multi-tasking to share computation and enable higher accuracy across tasks. We experiment with recent advances in deep network architectures and show that masking models can benefit from these, especially when combined with pre-training on large satellite imagery datasets. We present a collection of models offering different speed/accuracy trade-offs for masking. MobileNet variants are the fastest, and perform competitively with larger architectures. Transformer-based architectures are the slowest, but benefit the most from pre-training on large satellite imagery datasets. Our models provide a 9% F1 score improvement compared to previous work on water pixel identification. When integrated with an SSC estimation system, our models result in a 30x speedup while reducing estimation error by 2.64 mg/L, allowing for global-scale analysis. We also evaluate our model on a recently proposed cloud and cloud shadow estimation benchmark, where we outperform the current state-of-the-art model by at least 6% in F1 score.

Paper Structure

This paper contains 34 sections, 14 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Geographic distribution of train, validation, and test data. The dataset has a global coverage with train, validation, and test splits spanning different locations. Each dot represents the center of the sampled tile. Gaps are present to prevent overlap between different data splits, since each tile has a coverage of 109.8 km × 109.8 km. Following the computer science convention, training set is used for updating model weights during training, validation set is used for selecting hyperparameters, and test set is not seen by the model except when evaluating performance.
  • Figure 2: Comparison of multiple single-task models and multi-task model. (a) Evaluating a model for each output is resource intensive, since it would require running five separate models. (b) Shows a multi-task model setup where only one model is predicting all five outputs at the same time, using approximately one-fifths of the resources for training.
  • Figure 3: Pipeline for estimating suspended sediment concentration (SSC). (a) Standard SSC pipelines involve multiple inputs and several processing steps, which contribute to the memory and runtime requirements. (b) Our proposed pipeline only uses readily available Harmonised Landsat-Sentinel (HLS) satellite images and estimates all masks faster by using a single multi-task model. Using good quality water pixels by masking cloud, cloud shadow, snow/ice, and terrain shadow results in significantly better SSC estimates.
  • Figure 4: DeepLabv3+ multi-tasking model results on three samples from OPERA DSWx test set. The RGB images are shown together with the corresponding ground truth and predicted masks. White pixels denote the presence of the mask, and black pixels otherwise. The model predictions across different types of masks closely match the ground truth based on DSWx.
  • Figure 5: Water masking results of different methods. MNDWI results shown here were filtered to remove clouds and shadows, while results from other methods were not filtered in any way. MNDWI fails to identify several water pixels, while DWM tends to predict smooth water boundaries which could miss details. DeepLabv3+, MobileNetv3, and Swin-T outperform the baselines but DeepLabv3+ is more robust to noise (see row four in the figure).
  • ...and 4 more figures