Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification
Yunhe Liu
TL;DR
This paper tackles maritime vessel re-identification by introducing MCFormer, a Transformer-based architecture that models inter-image correlations at both global and local scales to counteract large intra-identity variations and local part missing. It deploys a Global Correlation Module (GCM) to build a global affinity across all input images using landmark-based proxies, and a Local Correlation Module (LCM) that maintains a memory bank of local features to pull nearby positives together via a clustering loss. The global and local representations are fused with a Multi-Scale Channel Attention mechanism, resulting in state-of-the-art performance on VesselReID, Warships-ReID, and BoatRe-ID datasets. The approach demonstrates that explicit cross-image correlation modeling, combined with robust multi-scale fusion, yields more discriminative and generalizable vessel representations under challenging maritime imaging conditions. This work has practical impact for enhanced maritime surveillance and situational awareness systems by improving re-identification accuracy across diverse platforms and conditions.
Abstract
Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.
