Table of Contents
Fetching ...

Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

Yuxuan Li, Yuming Chen, Yunheng Li, Ming-Ming Cheng, Xiang Li, Jian Yang

TL;DR

B BabelRS is proposed, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning and consistently outperforms state-of-the-art methods without bells and whistles.

Abstract

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.

Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining

TL;DR

B BabelRS is proposed, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning and consistently outperforms state-of-the-art methods without bells and whistles.

Abstract

Heterogeneous multi-modal remote sensing object detection aims to accurately detect objects from diverse sensors (e.g., RGB, SAR, Infrared). Existing approaches largely adopt a late alignment paradigm, in which modality alignment and task-specific optimization are entangled during downstream fine-tuning. This tight coupling complicates optimization and often results in unstable training and suboptimal generalization. To address these limitations, we propose BabelRS, a unified language-pivoted pretraining framework that explicitly decouples modality alignment from downstream task learning. BabelRS comprises two key components: Concept-Shared Instruction Aligning (CSIA) and Layerwise Visual-Semantic Annealing (LVSA). CSIA aligns each sensor modality to a shared set of linguistic concepts, using language as a semantic pivot to bridge heterogeneous visual representations. To further mitigate the granularity mismatch between high-level language representations and dense detection objectives, LVSA progressively aggregates multi-scale visual features to provide fine-grained semantic guidance. Extensive experiments demonstrate that BabelRS stabilizes training and consistently outperforms state-of-the-art methods without bells and whistles. Code: https://github.com/zcablii/SM3Det.
Paper Structure (34 sections, 21 equations, 7 figures, 5 tables)

This paper contains 34 sections, 21 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Conceptual comparison between (a) late alignment and (b) early, language-pivoted alignment paradigms for heterogeneous multi-modal remote sensing detection. Late alignment (a) entangles modality alignment with task optimization during fine-tuning, leading to gradient conflicts and unstable training. BabelRS (b) decouples these objectives via early semantic alignment, resulting in improved optimization stability and generalization.
  • Figure 2: Automatic Mixed Precision fine-tuning stability on SOI-Det dataset. Many existing models experience gradient explosion before completion, whereas BabelRS remains stable throughout fine-tuning.
  • Figure 3: Overview of the BabelRS framework. BabelRS consists of two key components: Concept-Shared Instruction Aligning, which aligns heterogeneous remote sensing modalities into a shared linguistic semantic space using instruction-following objectives, and Layerwise Visual-Semantic Annealing, which progressively integrates multi-scale visual features into the language-aligned representation to support dense object detection.
  • Figure 4: Training loss curves under identical finetuning protocols. Late-alignment methods exhibit slow convergence, while BabelRS starts from a lower initial loss and converges smoothly.
  • Figure 5: Comparison of feature merge strategies: (a) feature concatenation, (b) element-wise summation, (c) per-layer projectors with LVSA, and (d) the proposed LVSA-based merge with a shared projector.
  • ...and 2 more figures