Table of Contents
Fetching ...

A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation

Yuze Wang, Mariana Belgiu, Haiyang Wu, Dandan Zhong, Yangyang Cao, Chao Tao

TL;DR

Cloud-induced gaps in Satellite Image Time Series disrupt temporal dependencies and hinder agricultural semantic segmentation. The paper presents a joint learning framework that combines selective feature reconstruction with prediction, guided by a teacher model trained on complete SITS and simulated missingness via temporal masks. By aligning reconstructed features and distilling teacher knowledge into the predictor, the method reduces noise, mitigates shortcut learning, and preserves long-term temporal dynamics, improving cropland extraction and crop classification across sensors. Results on Hunan, Western France & Catalonia datasets using Sentinel-2 and PlanetScope demonstrate strong cross-sensor generalization and backbone-agnostic performance, making the approach broadly applicable for real-world agricultural monitoring.

Abstract

Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.

A Joint Learning Framework with Feature Reconstruction and Prediction for Incomplete Satellite Image Time Series in Agricultural Semantic Segmentation

TL;DR

Cloud-induced gaps in Satellite Image Time Series disrupt temporal dependencies and hinder agricultural semantic segmentation. The paper presents a joint learning framework that combines selective feature reconstruction with prediction, guided by a teacher model trained on complete SITS and simulated missingness via temporal masks. By aligning reconstructed features and distilling teacher knowledge into the predictor, the method reduces noise, mitigates shortcut learning, and preserves long-term temporal dynamics, improving cropland extraction and crop classification across sensors. Results on Hunan, Western France & Catalonia datasets using Sentinel-2 and PlanetScope demonstrate strong cross-sensor generalization and backbone-agnostic performance, making the approach broadly applicable for real-world agricultural monitoring.

Abstract

Satellite Image Time Series (SITS) is crucial for agricultural semantic segmentation. However, Cloud contamination introduces time gaps in SITS, disrupting temporal dependencies and causing feature shifts, leading to degraded performance of models trained on complete SITS. Existing methods typically address this by reconstructing the entire SITS before prediction or using data augmentation to simulate missing data. Yet, full reconstruction may introduce noise and redundancy, while the data-augmented model can only handle limited missing patterns, leading to poor generalization. We propose a joint learning framework with feature reconstruction and prediction to address incomplete SITS more effectively. During training, we simulate data-missing scenarios using temporal masks. The two tasks are guided by both ground-truth labels and the teacher model trained on complete SITS. The prediction task constrains the model from selectively reconstructing critical features from masked inputs that align with the teacher's temporal feature representations. It reduces unnecessary reconstruction and limits noise propagation. By integrating reconstructed features into the prediction task, the model avoids learning shortcuts and maintains its ability to handle varied missing patterns and complete SITS. Experiments on SITS from Hunan Province, Western France, and Catalonia show that our method improves mean F1-scores by 6.93% in cropland extraction and 7.09% in crop classification over baselines. It also generalizes well across satellite sensors, including Sentinel-2 and PlanetScope, under varying temporal missing rates and model backbones.

Paper Structure

This paper contains 24 sections, 6 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: The location and extent of datasets, where red and blue regions represent complete and incomplete subsets, respectively.
  • Figure 2: Monthly data availability of SITS samples in incomplete subsets: Hunan SEN, Hunan PLA, and Fr&Cat S4A datasets
  • Figure 3: The general framework of the proposed method
  • Figure 4: The toy case where two students, $S_1$ and $S_2$, are guided by the same teacher $T$. Before the standardization(Z-score), $S_1$ reconstructed features from incomplete SITS that are more similar to the teacher’s ideal features, yet it achieves a lower F1-score. In contrast, features from $S_2$ are less similar to ideal features and get a higher Mean Squared Error distance $L_{MSE}$, but they get a higher F1-score. After standardization, the issue is solved, which makes the model focus more on feature distribution instead of values and boundaries.
  • Figure 5: Visualization of sample results for our and compared methods in the Hunan SEN dataset: (a)-(c) from simulated experiments, (d)-(f) from real-world experiments.
  • ...and 4 more figures