Table of Contents
Fetching ...

AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping

Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi

TL;DR

AgriFM introduces a multi-source temporal foundation model built on a synchronized Video Swin Transformer to jointly model hierarchical spatiotemporal patterns in agricultural remote sensing. It leverages MODIS, Landsat-8/9, and Sentinel-2 data, supervised by land-cover fractions from GLC_FCS30D and a mean-teacher scheme, to pretrain a versatile decoder for diverse crop mapping tasks. Across agricultural land mapping, field boundary delineation, land use/land cover mapping, paddy rice mapping, and winter wheat mapping, AgriFM consistently outperforms ViT-, CNN-, and Swin-based baselines, especially in low-data regimes and for fine-grained spatial outputs. The approach demonstrates strong data efficiency, cross-temporal and cross-source generalization, and practical relevance for large-scale, multi-resolution agricultural monitoring.

Abstract

Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.

AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping

TL;DR

AgriFM introduces a multi-source temporal foundation model built on a synchronized Video Swin Transformer to jointly model hierarchical spatiotemporal patterns in agricultural remote sensing. It leverages MODIS, Landsat-8/9, and Sentinel-2 data, supervised by land-cover fractions from GLC_FCS30D and a mean-teacher scheme, to pretrain a versatile decoder for diverse crop mapping tasks. Across agricultural land mapping, field boundary delineation, land use/land cover mapping, paddy rice mapping, and winter wheat mapping, AgriFM consistently outperforms ViT-, CNN-, and Swin-based baselines, especially in low-data regimes and for fine-grained spatial outputs. The approach demonstrates strong data efficiency, cross-temporal and cross-source generalization, and practical relevance for large-scale, multi-resolution agricultural monitoring.

Abstract

Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.

Paper Structure

This paper contains 27 sections, 11 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Flowchart outlining the AgriFM: a) The initial phase involves the extraction of pre-training supervision from geographical priors (land cover products) and the assembly of an extensive pre-training dataset, b) The subsequent phase entails the pre-training of the multi-source temporal foundation model and construction of unified mapping framework.
  • Figure 2: The spatial distribution of pre-training samples collected on a global scale from Sentinel-2, Landsat-8/9 and MODIS.
  • Figure 3: Structure of foundation model, AgriFM, comprising four stages. The input satellite sequences (MODIS, Landsat-8/9, and Sentinel-2) are characterized by specific dimensional parameters: $T$ denotes the temporal length of each sequence (randonly selected from 3 to 32 frames), while $W$ and $H$ represent the spatial width and height (both fixed at 224 pixels). The number of spectral bands, $C$, varies depending on the data source. The decoder is purposed for the upsampling and fusion of features to yield mapping results, each marked by their respective labels.
  • Figure 4: The study area and dataset detailed information for downstream mapping tasks.
  • Figure 5: Comparative performance on agricultural land mapping and boundary delineation task. Bar plots show F1-score for the positive class across models, grouped by architecture type (CNN variants, ViT-based, and Swin-based). Radar plots show metrics (precision, recall, F1-score and OA) comparison across models.
  • ...and 9 more figures