Table of Contents
Fetching ...

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Zhenwei Shi

TL;DR

DynamicVis tackles the challenge of generalizable, high-resolution remote sensing understanding by introducing a dynamic region-aware backbone built on Selective State Space Models and a meta-embedding multi-instance MIL pretraining scheme. The model selectively processes informative tokens, balancing local detail with global context to handle sparse, small targets common in RS imagery, while maintaining scalability. Across nine downstream tasks spanning region-, instance-, and pixel-level analysis, DynamicVis delivers strong cross-task performance with substantially lower compute and memory requirements than ViT-based foundations. This combination of adaptive token routing and region-aware representation promises practical deployment for large-scale RS data, enabling efficient, versatile interpretation of complex geospatial scenes.

Abstract

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

TL;DR

DynamicVis tackles the challenge of generalizable, high-resolution remote sensing understanding by introducing a dynamic region-aware backbone built on Selective State Space Models and a meta-embedding multi-instance MIL pretraining scheme. The model selectively processes informative tokens, balancing local detail with global context to handle sparse, small targets common in RS imagery, while maintaining scalability. Across nine downstream tasks spanning region-, instance-, and pixel-level analysis, DynamicVis delivers strong cross-task performance with substantially lower compute and memory requirements than ViT-based foundations. This combination of adaptive token routing and region-aware representation promises practical deployment for large-scale RS data, enabling efficient, versatile interpretation of complex geospatial scenes.

Abstract

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

Paper Structure

This paper contains 39 sections, 15 equations, 21 figures, 19 tables.

Figures (21)

  • Figure 1: a): ViTs process all visual tokens uniformly. b): DynamicVis selectively extracts key tokens at each block to perform adaptive modeling. c): Memory consumption of different model architectures at varying input resolutions. d): The proposed DynamicVis demonstrates versatility in interpreting diverse temporal and spatial localization patterns. Comprehensive evaluations across nine downstream tasks, spanning region-, instance-, and pixel-level understanding, demonstrate its efficacy, generalizability, and scalability.
  • Figure 2: The overview of Dynamic Region-aware SSM Backbone, comprising four interconnected stages that generate hierarchical semantic feature maps at varying scales. Red boxes highlight regions of interest, while yellows denote regions exhibiting structural simplicity or repetitive patterns.
  • Figure 3: The structure of the Sparse Mixer, including three key elements: a flattening operation, $N_i$ selective token incremental modeling (STIM) units, and an un-flattening operation.
  • Figure 4: The detailed architecture of the Selective Token Incremental Modeling (STIM) unit.
  • Figure 5: The structure of dual-path SSM scanning.
  • ...and 16 more figures