FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring

Nikolaos Ioannis Bountos; Arthur Ouaknine; Ioannis Papoutsis; David Rolnick

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring

Nikolaos Ioannis Bountos, Arthur Ouaknine, Ioannis Papoutsis, David Rolnick

TL;DR

This work tackles scalable, global forest monitoring by introducing FoMo-Bench, a diverse, multi-modal benchmark and FoMo-Net, a sensor-agnostic pretraining framework for RS foundation models. It combines 15 datasets across sensors and modalities with tasks spanning classification, segmentation, and detection, plus the TalloS dataset for extensive tree-genus classification. FoMo-Net trains a single backbone to handle varied modalities and spectral bands via random band sampling and MAE-based reconstruction, achieving strong cross-task performance and demonstrating the feasibility of a unified foundation model for forest monitoring. The study highlights opportunities for larger encoders, modality-aware evaluation, and extensions to non-grid data, laying groundwork for scalable, multi-task, multi-modal RS models in ecological monitoring.

Abstract

Forests are vital to ecosystems, supporting biodiversity and essential services, but are rapidly changing due to land use and climate change. Understanding and mitigating negative effects requires parsing data on forests at global scale from a broad array of sensory modalities, and using them in diverse forest monitoring applications. Such diversity in data and applications can be effectively addressed through the development of a large, pre-trained foundation model that serves as a versatile base for various downstream tasks. However, remote sensing modalities, which are an excellent fit for several forest management tasks, are particularly challenging considering the variation in environmental conditions, object scales, image acquisition modes, spatio-temporal resolutions, etc. With that in mind, we present the first unified Forest Monitoring Benchmark (FoMo-Bench), carefully constructed to evaluate foundation models with such flexibility. FoMo-Bench consists of 15 diverse datasets encompassing satellite, aerial, and inventory data, covering a variety of geographical regions, and including multispectral, red-green-blue, synthetic aperture radar and LiDAR data with various temporal, spatial and spectral resolutions. FoMo-Bench includes multiple types of forest-monitoring tasks, spanning classification, segmentation, and object detection. To enhance task and geographic diversity in FoMo-Bench, we introduce TalloS, a global dataset combining satellite imagery with ground-based annotations for tree species classification across 1,000+ categories and hierarchical taxonomic levels. Finally, we propose FoMo-Net, a pre-training framework to develop foundation models with the capacity to process any combination of commonly used modalities and spectral bands in remote sensing.

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring

TL;DR

Abstract

Paper Structure (8 sections, 1 equation, 3 figures, 4 tables)

This paper contains 8 sections, 1 equation, 3 figures, 4 tables.

Introduction
Related Work
FoMo-Bench
The TalloS Dataset
FoMo-Net
Experiments
Discussion
Conclusion

Figures (3)

Figure 1: FoMo-Bench evaluation framework and the FoMo-Net pretraining framework for foundation models.
Figure 2: FoMo-Bench spatial distribution. (a) Spatial distribution of the datasets included in FoMo-Bench, without TalloS; satellite-based datasets are marked with arrows, aerial-based datasets with circles. (b) Distribution of the train, validation and test splits of the TalloS dataset worldwide.
Figure 3: FoMo-Net pre-training framework. Considering a set of potential spectral input bands and a sub-training process of $V$ batches, each batch includes a sub-set of bands w.r.t. to the sampled dataset and modalities it contains. For each batch, the input band embeddings, the spectral embeddings and the positional embeddings are summed element-wise to generate individual embeddings per band. The masked autoencoder, $f_{\theta}(\cdot)$ parameterized by $\theta$, reconstructs the partially masked inputs, illustrated with dashed squared, of a given batch. The gradients of the loss $\mathscr{L}(\cdot)$ are accumulated and backpropagated through both $f_{\theta}(\cdot)$ and the linear projections. For the sake of clarity, we do not include the elevation modality which is not considered as a spectral band, but included in the pre-training scheme.

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring

TL;DR

Abstract

FoMo: Multi-Modal, Multi-Scale and Multi-Task Remote Sensing Foundation Models for Forest Monitoring

Authors

TL;DR

Abstract

Table of Contents

Figures (3)