MOS: Mitigating Optical-SAR Modality Gap for Cross-Modal Ship Re-Identification
Yujian Zhao, Hankun Liu, Guanglin Niu
TL;DR
This work presents MOS, a framework for mitigating the optical–SAR modality gap in cross-modal ship re-identification. It combines Modality-Consistent Representation Learning to denoise SAR and align modal distributions with a class-wise CMAL loss, and Cross-modal Data Generation and Feature Fusion that uses a Brownian Bridge diffusion model to synthesize cross-modal samples for inference-time feature fusion. Across the HOSS ReID benchmark, MOS achieves state-of-the-art performance with significant gains in all evaluation settings, validating both training-time alignment and inference-time cross-modal synthesis. The approach offers a practical solution for robust maritime surveillance under heterogeneous sensing conditions, enabling more reliable cross-modal ship tracking.
Abstract
Cross-modal ship re-identification (ReID) between optical and synthetic aperture radar (SAR) imagery has recently emerged as a critical yet underexplored task in maritime intelligence and surveillance. However, the substantial modality gap between optical and SAR images poses a major challenge for robust identification. To address this issue, we propose MOS, a novel framework designed to mitigate the optical-SAR modality gap and achieve modality-consistent feature learning for optical-SAR cross-modal ship ReID. MOS consists of two core components: (1) Modality-Consistent Representation Learning (MCRL) applies denoise SAR image procession and a class-wise modality alignment loss to align intra-identity feature distributions across modalities. (2) Cross-modal Data Generation and Feature fusion (CDGF) leverages a brownian bridge diffusion model to synthesize cross-modal samples, which are subsequently fused with original features during inference to enhance alignment and discriminability. Extensive experiments on the HOSS ReID dataset demonstrate that MOS significantly surpasses state-of-the-art methods across all evaluation protocols, achieving notable improvements of +3.0%, +6.2%, and +16.4% in R1 accuracy under the ALL to ALL, Optical to SAR, and SAR to Optical settings, respectively. The code and trained models will be released upon publication.
