Table of Contents
Fetching ...

EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre

TL;DR

The paper tackles the resource-heavy paradigm of Earth Observation foundation models by proposing EoS-FM, an Ensemble-of-Specialists framework that aggregates multiple lightweight, task-specific ConvNeXtV2-Atto encoders. Encoders are kept frozen during downstream tasks, with a differentiable selection layer and a 1x1 fusion to produce compact representations, enabling strong performance across 11 RS tasks with significantly fewer parameters. The method demonstrates robust performance under label scarcity, scales through pruning via a top-k mechanism to produce compact variants, and naturally supports federated training. The work emphasizes modularity, efficiency, and sustainability, offering a practical path toward general-purpose RSFMs with broad applicability and open-source availability.

Abstract

Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.

EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

TL;DR

The paper tackles the resource-heavy paradigm of Earth Observation foundation models by proposing EoS-FM, an Ensemble-of-Specialists framework that aggregates multiple lightweight, task-specific ConvNeXtV2-Atto encoders. Encoders are kept frozen during downstream tasks, with a differentiable selection layer and a 1x1 fusion to produce compact representations, enabling strong performance across 11 RS tasks with significantly fewer parameters. The method demonstrates robust performance under label scarcity, scales through pruning via a top-k mechanism to produce compact variants, and naturally supports federated training. The work emphasizes modularity, efficiency, and sustainability, offering a practical path toward general-purpose RSFMs with broad applicability and open-source availability.

Abstract

Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs. All codes and pretrained models are available at https://github.com/pierreadorni/EoS-FM.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The proposed EoS-FM demonstrates strong and consistent performance across 11 remote sensing tasks, emerging as the most balanced foundation model among those evaluated on the Pangaea Benchmark marsocciPANGAEAGlobalInclusive2024, despite having fewer parameters. For each method, we show the number of parameters and the average DTB (Distance To Best) metric (lower is better) which will be described in Sec. \ref{['sec:downstream']}.
  • Figure 2: The EoS-FM Backbone adapts any given input to a multitude of formats using band duplication and selection to extract as many feature maps as possible, and then fuse them. Each encoder produces $n$ feature maps; a subset of $k$ encoders is then selected for fusion, and their $k \times n$ feature maps are aggregated into $n$ fused feature maps before being passed to the decoder.
  • Figure 3: Variance of the feature maps computed by different encoders on the HLS Burn Scars training set. The variance changes a lot between encoders, which could create problems when training an ensemble.
  • Figure 4: Ablation study: increasing the number of encoders increases the performance of the ensemble in a frozen setting. Experiments are performed on the HLS Burn Scars dataset. The validation mIoU is reported.