Table of Contents
Fetching ...

One for All: Toward Unified Foundation Models for Earth Vision

Zhitong Xiong, Yi Wang, Fahong Zhang, Xiao Xiang Zhu

TL;DR

Remote sensing foundation models are typically tied to single modalities or resolutions, limiting cross-dataset applicability. The authors propose OFA-Net, a unified foundation model that uses a single shared Transformer backbone with modality-specific patch embeddings and masked image modeling to learn from a curated multi-modal Earth-observation dataset. Evaluations on 12 GEO-Bench tasks show OFA-Net surpasses single-modality pretraining and random initialization on both classification and segmentation tasks, demonstrating improved generalization across modalities and resolutions. This work advances toward a truly unified Earth-vision backbone, potentially simplifying deployment and expanding cross-modality analysis.

Abstract

Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets. While there have been attempts to develop multi-modal remote sensing foundation models, they typically employ separate vision encoders for each modality or spatial resolution, necessitating a switch in backbones contingent upon the input data. To address this issue, we introduce a simple yet effective method, termed OFA-Net (One-For-All Network): employing a single, shared Transformer backbone for multiple data modalities with different spatial resolutions. Using the masked image modeling mechanism, we pre-train a single Transformer backbone on a curated multi-modal dataset with this simple design. Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. The proposed method is evaluated on 12 distinct downstream tasks and demonstrates promising performance.

One for All: Toward Unified Foundation Models for Earth Vision

TL;DR

Remote sensing foundation models are typically tied to single modalities or resolutions, limiting cross-dataset applicability. The authors propose OFA-Net, a unified foundation model that uses a single shared Transformer backbone with modality-specific patch embeddings and masked image modeling to learn from a curated multi-modal Earth-observation dataset. Evaluations on 12 GEO-Bench tasks show OFA-Net surpasses single-modality pretraining and random initialization on both classification and segmentation tasks, demonstrating improved generalization across modalities and resolutions. This work advances toward a truly unified Earth-vision backbone, potentially simplifying deployment and expanding cross-modality analysis.

Abstract

Foundation models characterized by extensive parameters and trained on large-scale datasets have demonstrated remarkable efficacy across various downstream tasks for remote sensing data. Current remote sensing foundation models typically specialize in a single modality or a specific spatial resolution range, limiting their versatility for downstream datasets. While there have been attempts to develop multi-modal remote sensing foundation models, they typically employ separate vision encoders for each modality or spatial resolution, necessitating a switch in backbones contingent upon the input data. To address this issue, we introduce a simple yet effective method, termed OFA-Net (One-For-All Network): employing a single, shared Transformer backbone for multiple data modalities with different spatial resolutions. Using the masked image modeling mechanism, we pre-train a single Transformer backbone on a curated multi-modal dataset with this simple design. Then the backbone model can be used in different downstream tasks, thus forging a path towards a unified foundation backbone model in Earth vision. The proposed method is evaluated on 12 distinct downstream tasks and demonstrates promising performance.
Paper Structure (9 sections, 4 figures, 2 tables)

This paper contains 9 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of the proposed method. Our model is designed to handle input data from a range of modalities, and varying spatial resolutions, such as 30 meters and 1 meter, using a singular, unified framework. This integrative approach allows for the simultaneous processing of all modalities within one comprehensive model.
  • Figure 2: Illustration of existing and the proposed foundation models for multi-modal data.
  • Figure 3: Detailed information of the five sub-datasets in the curated multi-modal dataset.
  • Figure 4: Workflow of the proposed unified foundation model for multiple data modalities.