Table of Contents
Fetching ...

Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

Hao Zhu, Yan Zhu, Jiayu Xiao, Tianxiang Xiao, Yike Ma, Yucheng Zhang, Feng Dai

TL;DR

The paper tackles weakly supervised semantic segmentation for Satellite Image Time Series (SITS) crop mapping by identifying spatial noise and temporal bias as key obstacles. It proposes Exact, a space-time perceptive clues framework that builds clue-based CAMs (CB-CAMs) through spatial prototypes and temporal-to-class attention, coupled with a clue-based contrastive loss and temporal-aware affinity propagation to suppress erroneous regions. A TSViT-based temporal-spatial backbone processes $\mathbf{X}\in\mathbb{R}^{T\times C\times H\times W}$ into dense embeddings, from which CB-CAMs generate high-quality pseudo labels for segmentation. On the PASTIS and Germany benchmarks, Exact achieves up to 95% of fully supervised performance, demonstrating strong potential for annotation-efficient crop mapping in practice.

Abstract

Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.

Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation

TL;DR

The paper tackles weakly supervised semantic segmentation for Satellite Image Time Series (SITS) crop mapping by identifying spatial noise and temporal bias as key obstacles. It proposes Exact, a space-time perceptive clues framework that builds clue-based CAMs (CB-CAMs) through spatial prototypes and temporal-to-class attention, coupled with a clue-based contrastive loss and temporal-aware affinity propagation to suppress erroneous regions. A TSViT-based temporal-spatial backbone processes into dense embeddings, from which CB-CAMs generate high-quality pseudo labels for segmentation. On the PASTIS and Germany benchmarks, Exact achieves up to 95% of fully supervised performance, demonstrating strong potential for annotation-efficient crop mapping in practice.

Abstract

Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.

Paper Structure

This paper contains 26 sections, 18 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of the two inherent issues arised from spatial and temporal perspectives in SITS. (a) shows noise pertubation from the spatial perspective. We visual the high-level feature manifold of Dog (natural image) and Barley (SITS) to reveal the distinct spatial properties. The feature dimensions are reduced by t-SNE tsne. (b) shows the erroneous semantic bias induced by anomalous temporal clips. We denote the parcel regions with $\bigstar$.
  • Figure 2: (a) The training pipeline of Exact. We adopt the Temporal-Spatio scheme to handle the SITS input, which contains two transformer encoders. The first temporal encoder models interactions between acquisition times, then the followed spatial encoder discards the temporal dimension and models interactions between spatial positions. To overcome the difficulties arised from spatial and temporal aspects, we propose two novel technologies in temporal embedding space: (b) Explore Spatial Perceptive Clues to mitigate the noise perturbation (see $\S$\ref{['sec:clue']}) and (c) Temporal-Aware Affinity Propagation to rectify the wrong semantic bias(see $\S$\ref{['sec:affinity']}).
  • Figure 3: Visualization of the clue-based CAM generation. This process is performed after training the classification network.
  • Figure 4: Visualization of temporal feature spaces on PASTIS train set. The feature dimensions are reduced by t-SNE tsne.
  • Figure 5: Effect of the hyper-parameters. (a) the number of class-specific prototypes $N_p$. (b) the temperature of similarity $\tau$.
  • ...and 6 more figures