Table of Contents
Fetching ...

OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Xiaoyu Tang, Jun Dong, Jintao Cheng, Rui Fan

Abstract

Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

OptiSAR-Net++: A Large-Scale Benchmark and Transformer-Free Framework for Cross-Domain Remote Sensing Visual Grounding

Abstract

Remote sensing visual grounding (RSVG) aims to localize specific targets in remote sensing images using natural language expressions. However, existing methods are restricted to single-sensor domains, i.e., either optical or synthetic aperture radar (SAR), limiting their real-world applicability. In this paper, we introduce the Cross-Domain RSVG (CD-RSVG) task and construct OptSAR-RSVG, the first large-scale benchmark dataset for this setting. To tackle the challenges of cross-domain feature modeling, computational inefficiency, and fine-grained semantic discrimination, we propose OptiSAR-Net++. Our framework features a patch-level Low-Rank Adaptation Mixture of Experts (PL-MoE) for efficient cross-domain feature decoupling. To mitigate the substantial computational overhead of Transformer decoding frameworks, we adopt a CLIP-based contrastive paradigm and further incorporate dynamic adversarial negative sampling, thereby transforming generative regression into an efficient cross-modal matching process. Additionally, a text-guided dual-gate fusion module (TGDF-SSA) and a region-aware auxiliary head are introduced to enhance semantic-visual alignment and spatial modeling. Extensive experiments demonstrate that OptiSAR-Net++ achieves SOTA performance on both OptSAR-RSVG and DIOR-RSVG benchmarks, offering significant advantages in localization accuracy and efficiency. Our code and dataset will be made publicly available.

Paper Structure

This paper contains 21 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of existing methods and our proposed CD-RSVG paradigm. (a) Existing cross-domain methods rely on co-registered optical-SAR pairs for detection only, lacking language grounding. (b) Mainstream RSVG methods rely on heavy Transformers and single-source data, lacking cross-domain generalization. (c) Our OptiSAR-Net++ handles multi-source inputs using a shared MoE backbone. It fuses visual-linguistic features and achieves efficient grounding via a CLIP matching paradigm. During training, a region-aware auxiliary head and adversarial negative sampling further enhance spatial modeling and fine-grained semantic discrimination.
  • Figure 2: Overall architecture of OptiSAR-Net++. The framework processes multi-source images and language queries for CD-RSVG via three main components: (1) A shared CNN backbone with PL-MoE, which adaptively routes image patches to domain-specific experts for cross-domain feature modeling. (2) A vision-language fusion neck (TGDF-SSA) that injects semantic information into multi-scale visual features. Text colors denote target categories (orange/blue) and attributes (green). (3) Detection heads comprising a regression head for candidate generation, a CLIP-based contrastive head for efficient retrieval matching, and an auxiliary region-aware classification head for spatial distribution modeling. During training, adversarial negatives (red text) are dynamically sampled to enhance fine-grained cross-domain grounding. Best viewed in color.
  • Figure 3: Statistical analysis of the OptSAR-RSVG dataset. (a)-(c) Normalized bounding box height, width, and area distributions for optical (blue) and SAR (green) samples. Optical targets show broader scale variations, whereas SAR targets concentrate on medium-to-small scales. (d) Cross-domain sample quantity, reflecting practical data acquisition ratios. (e) Caption word count distribution (average: 11.13 words). (f) Average pixel area per category, highlighting inter-category scale diversity. (g) Sample distribution across 16 categories. Best viewed in color.
  • Figure 4: Word cloud visualizations of textual descriptions in OptSAR-RSVG. (a) Target attributes, highlighting size and color descriptors. (b) Overall vocabulary, dominated by terms like "image", "ship", and modality identifiers. (c) Category names. (d) Spatial directional vocabulary, providing crucial semantic cues for localization.
  • Figure 5: Representative samples from the OptSAR-RSVG dataset, covering diverse optical (top row) and SAR (bottom row) scenarios. Each image is paired with a bounding box and a textual description containing target attributes, categories, and spatial cues. Optical samples exhibit rich semantic details, while SAR samples demonstrate target localization under challenging low-contrast conditions.
  • ...and 6 more figures