Table of Contents
Fetching ...

Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation Model

Denys Godwin, Hanxi Li, Michael Cecil, Hamed Alemohammad

TL;DR

The study tackles cloud-induced gaps in multispectral time-series and juxtaposes a Geospatial Foundation Model (Prithvi ViT) against a CGAN baseline for cloud-gap imputation. It demonstrates that a pretrained ViT, when fine-tuned with real cloud masks, achieves superior $MAE$ and $SSIM$ metrics across realistic masking schemes, even with limited fine-tuning. In both masking regimes, the Prithvi approach outperforms the CGAN, achieving low $MAE$ (around $0.03$ in zero-shot for extensive masking) and robust spatial-temporal consistency. The findings highlight the practicality of GFM-based cloud-gap imputation for augmenting complete time-series data and supporting downstream tasks like land-use monitoring and crop yield estimation, with future work exploring additional data modalities such as DEMs and land-cover layers.

Abstract

Filling cloudy pixels in multispectral satellite imagery is essential for accurate data analysis and downstream applications, especially for tasks which require time series data. To address this issue, we compare the performance of a foundational Vision Transformer (ViT) model with a baseline Conditional Generative Adversarial Network (CGAN) model for missing value imputation in time series of multispectral satellite imagery. We randomly mask time series of satellite images using real-world cloud masks and train each model to reconstruct the missing pixels. The ViT model is fine-tuned from a pretrained model, while the CGAN is trained from scratch. Using quantitative evaluation metrics such as structural similarity index and mean absolute error as well as qualitative visual analysis, we assess imputation accuracy and contextual preservation.

Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation Model

TL;DR

The study tackles cloud-induced gaps in multispectral time-series and juxtaposes a Geospatial Foundation Model (Prithvi ViT) against a CGAN baseline for cloud-gap imputation. It demonstrates that a pretrained ViT, when fine-tuned with real cloud masks, achieves superior and metrics across realistic masking schemes, even with limited fine-tuning. In both masking regimes, the Prithvi approach outperforms the CGAN, achieving low (around in zero-shot for extensive masking) and robust spatial-temporal consistency. The findings highlight the practicality of GFM-based cloud-gap imputation for augmenting complete time-series data and supporting downstream tasks like land-use monitoring and crop yield estimation, with future work exploring additional data modalities such as DEMs and land-cover layers.

Abstract

Filling cloudy pixels in multispectral satellite imagery is essential for accurate data analysis and downstream applications, especially for tasks which require time series data. To address this issue, we compare the performance of a foundational Vision Transformer (ViT) model with a baseline Conditional Generative Adversarial Network (CGAN) model for missing value imputation in time series of multispectral satellite imagery. We randomly mask time series of satellite images using real-world cloud masks and train each model to reconstruct the missing pixels. The ViT model is fine-tuned from a pretrained model, while the CGAN is trained from scratch. Using quantitative evaluation metrics such as structural similarity index and mean absolute error as well as qualitative visual analysis, we assess imputation accuracy and contextual preservation.
Paper Structure (9 sections, 1 equation, 19 figures, 2 tables)

This paper contains 9 sections, 1 equation, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Training the CGAN to impute cloudy pixels was accomplished by masking out clouds from input data, using this as the condition on which to generate, then comparing the generated data against the unmasked ground truth.
  • Figure 2: Reconstruction of a high-coverage image using CGAN and Prithvi, both trained using 6,231 images from E1 experiments (applying mask to the middle scene)
  • Figure A.3: MAE and SSIM for best epoch of 200 for E1 experiments (applying mask to the middle scene). Best epoch is the best performance out of 5 runs for all experiments across all epochs.
  • Figure A.4: MAE and SSIM for best epoch of 200 for E2 experiments (applying masks in all time steps). Best epoch is the best performance out of 5 runs for all experiments across all epochs.
  • Figure A.5: Relationships between cloud cover, time gap, SSIM, and MAE of validation chips for best results of E1 experiments using the full dataset to fine-tune Prithvi. Statistics are calculated for each validation chip.
  • ...and 14 more figures