Table of Contents
Fetching ...

MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

Yujian Zhao, Hankun Liu, Guanglin Niu

TL;DR

This work presents MOS, a framework for mitigating the optical–SAR modality gap in cross-modal ship re-identification. It combines Modality-Consistent Representation Learning to denoise SAR and align modal distributions with a class-wise CMAL loss, and Cross-modal Data Generation and Feature Fusion that uses a Brownian Bridge diffusion model to synthesize cross-modal samples for inference-time feature fusion. Across the HOSS ReID benchmark, MOS achieves state-of-the-art performance with significant gains in all evaluation settings, validating both training-time alignment and inference-time cross-modal synthesis. The approach offers a practical solution for robust maritime surveillance under heterogeneous sensing conditions, enabling more reliable cross-modal ship tracking.

Abstract

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.

MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification

TL;DR

This work presents MOS, a framework for mitigating the optical–SAR modality gap in cross-modal ship re-identification. It combines Modality-Consistent Representation Learning to denoise SAR and align modal distributions with a class-wise CMAL loss, and Cross-modal Data Generation and Feature Fusion that uses a Brownian Bridge diffusion model to synthesize cross-modal samples for inference-time feature fusion. Across the HOSS ReID benchmark, MOS achieves state-of-the-art performance with significant gains in all evaluation settings, validating both training-time alignment and inference-time cross-modal synthesis. The approach offers a practical solution for robust maritime surveillance under heterogeneous sensing conditions, enabling more reliable cross-modal ship tracking.

Abstract

Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.

Paper Structure

This paper contains 18 sections, 20 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Noise of SAR image (left). Modality gap and ID gap in feature space (right).
  • Figure 2: Overview of our proposed MOS. MOS consists of two parts: Modality-Consistent Representation Learning (MCRL) and Cross-modal Data Generation and Feature fusion (CDGF). MCRL performs effective noise suppression on SAR images and introduces a class-wise modality alignment loss to align the distributions of multi-modal samples sharing the same identity during training. CDGF, on the other hand, generates corresponding SAR samples from optical inputs during inference and fuses them with original features to further enhance retrieval performance.
  • Figure 3: R1 heatmap based on $\alpha$ and $\lambda_\text{cmal}$ in SAR to Optical protocol.
  • Figure 4: Visualization of relationship of $\tau$ and R1 in different evaluation protocols.
  • Figure 5: Visualization of top-5 retrieval results in ALL to ALL, Optical to SAR and SAR to Optical protocols. Blue borders indicate correct search results, while red borders denote errors.
  • ...and 2 more figures