A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation
Yuze Wang, Mariana Belgiu, Haiyang Wu, Dandan Zhong, Yangyang Cao, Chao Tao
TL;DR
Cloud-induced gaps in Satellite Image Time Series disrupt temporal dependencies and hinder agricultural semantic segmentation. The paper presents a joint learning framework that combines selective feature reconstruction with prediction, guided by a teacher model trained on complete SITS and simulated missingness via temporal masks. By aligning reconstructed features and distilling teacher knowledge into the predictor, the method reduces noise, mitigates shortcut learning, and preserves long-term temporal dynamics, improving cropland extraction and crop classification across sensors. Results on Hunan, Western France & Catalonia datasets using Sentinel-2 and PlanetScope demonstrate strong cross-sensor generalization and backbone-agnostic performance, making the approach broadly applicable for real-world agricultural monitoring.
Abstract
Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.
