Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation
Hao Zhu, Yan Zhu, Jiayu Xiao, Tianxiang Xiao, Yike Ma, Yucheng Zhang, Feng Dai
TL;DR
The paper tackles weakly supervised semantic segmentation for Satellite Image Time Series (SITS) crop mapping by identifying spatial noise and temporal bias as key obstacles. It proposes Exact, a space-time perceptive clues framework that builds clue-based CAMs (CB-CAMs) through spatial prototypes and temporal-to-class attention, coupled with a clue-based contrastive loss and temporal-aware affinity propagation to suppress erroneous regions. A TSViT-based temporal-spatial backbone processes $\mathbf{X}\in\mathbb{R}^{T\times C\times H\times W}$ into dense embeddings, from which CB-CAMs generate high-quality pseudo labels for segmentation. On the PASTIS and Germany benchmarks, Exact achieves up to 95% of fully supervised performance, demonstrating strong potential for annotation-efficient crop mapping in practice.
Abstract
Automated crop mapping through Satellite Image Time Series (SITS) has emerged as a crucial avenue for agricultural monitoring and management. However, due to the low resolution and unclear parcel boundaries, annotating pixel-level masks is exceptionally complex and time-consuming in SITS. This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden. The unique characteristics of SITS give rise to several challenges in weakly supervised learning: (1) noise perturbation from spatially neighboring regions, and (2) erroneous semantic bias from anomalous temporal periods. To address the above difficulties, we propose a novel method, termed exploring space-time perceptive clues (Exact). First, we introduce a set of spatial clues to explicitly capture the representative patterns of different crops from the most class-relative regions. Besides, we leverage the temporal-to-class interaction of the model to emphasize the contributions of pivotal clips, thereby enhancing the model perception for crop regions. Build upon the space-time perceptive clues, we derive the clue-based CAMs to effectively supervise the SITS segmentation network. Our method demonstrates impressive performance on various SITS benchmarks. Remarkably, the segmentation network trained on Exact-generated masks achieves 95% of its fully supervised performance, showing the bright promise of weakly supervised paradigm in crop mapping scenario. Our code will be publicly available.
