Table of Contents
Fetching ...

RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

Zhechao Wang, Peirui Cheng, Pengju Tian, Yuchao Wang, Mingxin Chen, Shujing Duan, Zhirui Wang, Xinming Li, Xian Sun

TL;DR

This work addresses the limitation of single-platform remote sensing foundation models by enabling online, multi-platform collaboration for diverse downstream tasks. It introduces RS-DFM, a distributed framework that maps observations from multiple platforms into a unified BEV space using a Generalized BEV Generation module with relative depth estimation and a High-Low Frequency Decoupled Collaboration module for bandwidth-efficient feature fusion. The approach is validated on the AirCo-MultiTasks dataset across 3D object detection, BEV instance segmentation, and trajectory prediction, demonstrating state-of-the-art performance and substantial transmission cost reductions. The results indicate that combining geometric priors with frequency-based information decoupling enables robust, scalable, cross-platform RS perception suitable for multi-UAV deployments and real-time multi-task inference.

Abstract

Remote sensing lightweight foundation models have achieved notable success in online perception within remote sensing. However, their capabilities are restricted to performing online inference solely based on their own observations and models, thus lacking a comprehensive understanding of large-scale remote sensing scenarios. To overcome this limitation, we propose a Remote Sensing Distributed Foundation Model (RS-DFM) based on generalized information mapping and interaction. This model can realize online collaborative perception across multiple platforms and various downstream tasks by mapping observations into a unified space and implementing a task-agnostic information interaction strategy. Specifically, we leverage the ground-based geometric prior of remote sensing oblique observations to transform the feature mapping from absolute depth estimation to relative depth estimation, thereby enhancing the model's ability to extract generalized features across diverse heights and perspectives. Additionally, we present a dual-branch information compression module to decouple high-frequency and low-frequency feature information, achieving feature-level compression while preserving essential task-agnostic details. In support of our research, we create a multi-task simulation dataset named AirCo-MultiTasks for multi-UAV collaborative observation. We also conduct extensive experiments, including 3D object detection, instance segmentation, and trajectory prediction. The numerous results demonstrate that our RS-DFM achieves state-of-the-art performance across various downstream tasks.

RS-DFM: A Remote Sensing Distributed Foundation Model for Diverse Downstream Tasks

TL;DR

This work addresses the limitation of single-platform remote sensing foundation models by enabling online, multi-platform collaboration for diverse downstream tasks. It introduces RS-DFM, a distributed framework that maps observations from multiple platforms into a unified BEV space using a Generalized BEV Generation module with relative depth estimation and a High-Low Frequency Decoupled Collaboration module for bandwidth-efficient feature fusion. The approach is validated on the AirCo-MultiTasks dataset across 3D object detection, BEV instance segmentation, and trajectory prediction, demonstrating state-of-the-art performance and substantial transmission cost reductions. The results indicate that combining geometric priors with frequency-based information decoupling enables robust, scalable, cross-platform RS perception suitable for multi-UAV deployments and real-time multi-task inference.

Abstract

Remote sensing lightweight foundation models have achieved notable success in online perception within remote sensing. However, their capabilities are restricted to performing online inference solely based on their own observations and models, thus lacking a comprehensive understanding of large-scale remote sensing scenarios. To overcome this limitation, we propose a Remote Sensing Distributed Foundation Model (RS-DFM) based on generalized information mapping and interaction. This model can realize online collaborative perception across multiple platforms and various downstream tasks by mapping observations into a unified space and implementing a task-agnostic information interaction strategy. Specifically, we leverage the ground-based geometric prior of remote sensing oblique observations to transform the feature mapping from absolute depth estimation to relative depth estimation, thereby enhancing the model's ability to extract generalized features across diverse heights and perspectives. Additionally, we present a dual-branch information compression module to decouple high-frequency and low-frequency feature information, achieving feature-level compression while preserving essential task-agnostic details. In support of our research, we create a multi-task simulation dataset named AirCo-MultiTasks for multi-UAV collaborative observation. We also conduct extensive experiments, including 3D object detection, instance segmentation, and trajectory prediction. The numerous results demonstrate that our RS-DFM achieves state-of-the-art performance across various downstream tasks.
Paper Structure (34 sections, 9 equations, 8 figures, 5 tables)

This paper contains 34 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of collaborative perception among multiple remote sensing platforms. In this scene, several platforms observe the same area from different angles and heights. Through inter-platform collaboration, they achieve a more comprehensive understanding of the scene, enhancing the performance of various downstream tasks.
  • Figure 2: The overall architecture of our proposed RS-DFM framework. For clarity, we just present the collaboration between two platforms.
  • Figure 3: Operation flow of the GBG module. This module introduces a generalized feature mapping through relative depth estimation, which enhances the accuracy of BEV representation.
  • Figure 4: Operation flow of the HLFDC module. This module compresses the original local features through two branches: low frequency and high frequency. In the high-frequency branch, local attention is conducted to obtain high-frequency features reflecting objects' edges, shapes, and other details. These high-frequency features are then condensed into low-dimensional channels. Conversely, the low-frequency branch preserves the overall structural information through low-filtered average pooling and global attention. The high-frequency branch condenses spatial information into down-sampled representations.
  • Figure 5: Statistical charts for the AirCo-MultiTasks dataset, depicting the occlusion within a single view, the number of various object types, the distribution of distances between objects and drones, and the proportion of objects within observations, respectively.
  • ...and 3 more figures