Table of Contents
Fetching ...

Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series

Iris Dumeur, Jérémy Anger, Gabriele Facciolo

Abstract

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.

Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time series

Abstract

Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.
Paper Structure (28 sections, 18 equations, 6 figures, 2 tables)

This paper contains 28 sections, 18 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the proposed multi-modal spectro-spatio temporal architecture, composed of: a modality specific spectro-spatial encoder (noted $g^{S1}$ resp. $g^{S2}$), a temporal fusion encoder, an upsampling operation with pixel shuffle (P.S) and a task specific decoder layer.
  • Figure 2: Multi-modal forecasting framework.
  • Figure 3: Visualization of the reconstruction in the multi-modal forecasting task using the Time CosFormer mechanism. The top row corresponds to the input and target SITS. The $(n-1)^{\text{th}}$ images are used by the model to predict the $n^{\text{th}}$ one. The bottom row corresponds to the reconstruction by our model.
  • Figure 4: Boxplot showing the MSE on the forecasting task for various temporal fusion encoders. Top row represents result obtained in the mono-modal S2 forecasting framework, bottom row corresponds to the multi-modal reconstruction loss. On the boxplot a line is drawn at the median value.
  • Figure 5: Effect of the look-back length in the forecasting task for various dual-form mechanisms. Mono-modal results are displayed on the top-line and multi-modal results on the bottom line.
  • ...and 1 more figures