Table of Contents
Fetching ...

Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

Yiqun Wang, Lujun Li, Meiru Yue, Radu State

TL;DR

The paper tackles cloud-induced data gaps in time-series multispectral imagery by introducing a ViViT-based framework that uses temporal-spatial tubelet embeddings with a constrained temporal span (t=2) to preserve local spectral dynamics. It fuses MSI with SAR data to achieve all-weather reconstruction and employs a three-component architecture (3D tubelet embedding, joint-temporal-spatial MHSA, and linear patch decoding) optimized with a multi-scale MSE+SAM loss. On Traill County data, the proposed SMTS-ViViT approach consistently outperforms MSI-only and standard SAR-MSI baselines across MSE, SAM, PSNR, and SSIM, with SAR fusion yielding notable gains especially under higher cloud cover. The work demonstrates a practical, robust strategy for agricultural monitoring under cloudy conditions and lays groundwork for flexible cross-temporal analysis with all-weather fusion.

Abstract

Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span $(t=2)$, ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.

Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion: A Multi-Head Self-Attention Video Vision Transformer Approach

TL;DR

The paper tackles cloud-induced data gaps in time-series multispectral imagery by introducing a ViViT-based framework that uses temporal-spatial tubelet embeddings with a constrained temporal span (t=2) to preserve local spectral dynamics. It fuses MSI with SAR data to achieve all-weather reconstruction and employs a three-component architecture (3D tubelet embedding, joint-temporal-spatial MHSA, and linear patch decoding) optimized with a multi-scale MSE+SAM loss. On Traill County data, the proposed SMTS-ViViT approach consistently outperforms MSI-only and standard SAR-MSI baselines across MSE, SAM, PSNR, and SSIM, with SAR fusion yielding notable gains especially under higher cloud cover. The work demonstrates a practical, robust strategy for agricultural monitoring under cloudy conditions and lays groundwork for flexible cross-temporal analysis with all-weather fusion.

Abstract

Cloud cover in multispectral imagery (MSI) significantly hinders early-season crop mapping by corrupting spectral information. Existing Vision Transformer(ViT)-based time-series reconstruction methods, like SMTS-ViT, often employ coarse temporal embeddings that aggregate entire sequences, causing substantial information loss and reducing reconstruction accuracy. To address these limitations, a Video Vision Transformer (ViViT)-based framework with temporal-spatial fusion embedding for MSI reconstruction in cloud-covered regions is proposed in this study. Non-overlapping tubelets are extracted via 3D convolution with constrained temporal span , ensuring local temporal coherence while reducing cross-day information degradation. Both MSI-only and SAR-MSI fusion scenarios are considered during the experiments. Comprehensive experiments on 2020 Traill County data demonstrate notable performance improvements: MTS-ViViT achieves a 2.23\% reduction in MSE compared to the MTS-ViT baseline, while SMTS-ViViT achieves a 10.33\% improvement with SAR integration over the SMTS-ViT baseline. The proposed framework effectively enhances spectral reconstruction quality for robust agricultural monitoring.

Paper Structure

This paper contains 24 sections, 15 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The Study Area: Traill County, located in North Dakota, the USA, 2020.
  • Figure 2: Multi-Temporal SAR and MSI Acquisition Scheme and Data Example. The black pixels in MSI present the cloud mask.
  • Figure 3: The Video Vision Transformer Structure with the Temporal-Spatial Tubelet Embedding for Cloud-Robust MSI Reconstruction using MSI-SAR Fusion.
  • Figure 4: Overview of Multi-Modal Data Inputs and Cloud Masks. There are four rows: the first row shows target MSI reconstruction images, the second displays cloud masks (where black pixels indicate real cloud occlusion and red pixels indicate artificial cloud coverage), the third row contains input MSI data, and the fourth row presents input SAR data for the ViViT model. The x-axis indicates the timeline of data acquisition.
  • Figure 5: Cloud Removal and Reconstruction Results Across Models. Black pixels present cloud masks.