Table of Contents
Fetching ...

Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification

Yunhe Liu

TL;DR

This paper tackles maritime vessel re-identification by introducing MCFormer, a Transformer-based architecture that models inter-image correlations at both global and local scales to counteract large intra-identity variations and local part missing. It deploys a Global Correlation Module (GCM) to build a global affinity across all input images using landmark-based proxies, and a Local Correlation Module (LCM) that maintains a memory bank of local features to pull nearby positives together via a clustering loss. The global and local representations are fused with a Multi-Scale Channel Attention mechanism, resulting in state-of-the-art performance on VesselReID, Warships-ReID, and BoatRe-ID datasets. The approach demonstrates that explicit cross-image correlation modeling, combined with robust multi-scale fusion, yields more discriminative and generalizable vessel representations under challenging maritime imaging conditions. This work has practical impact for enhanced maritime surveillance and situational awareness systems by improving re-identification accuracy across diverse platforms and conditions.

Abstract

Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.

Multi-Scale Correlation-Aware Transformer for Maritime Vessel Re-Identification

TL;DR

This paper tackles maritime vessel re-identification by introducing MCFormer, a Transformer-based architecture that models inter-image correlations at both global and local scales to counteract large intra-identity variations and local part missing. It deploys a Global Correlation Module (GCM) to build a global affinity across all input images using landmark-based proxies, and a Local Correlation Module (LCM) that maintains a memory bank of local features to pull nearby positives together via a clustering loss. The global and local representations are fused with a Multi-Scale Channel Attention mechanism, resulting in state-of-the-art performance on VesselReID, Warships-ReID, and BoatRe-ID datasets. The approach demonstrates that explicit cross-image correlation modeling, combined with robust multi-scale fusion, yields more discriminative and generalizable vessel representations under challenging maritime imaging conditions. This work has practical impact for enhanced maritime surveillance and situational awareness systems by improving re-identification accuracy across diverse platforms and conditions.

Abstract

Maritime vessel re-identification (Re-ID) plays a crucial role in advancing maritime monitoring and intelligent situational awareness systems. However, some existing vessel Re-ID methods are directly adapted from pedestrian-focused algorithms, making them ill-suited for mitigating the unique problems present in vessel images, particularly the greater intra-identity variations and more severe missing of local parts, which lead to the emergence of outlier samples within the same identity. To address these challenges, we propose the Multi-scale Correlation-aware Transformer Network (MCFormer), which explicitly models multi-scale correlations across the entire input set to suppress the adverse effects of outlier samples with intra-identity variations or local missing, incorporating two novel modules, the Global Correlation Module (GCM), and the Local Correlation Module (LCM). Specifically, GCM constructs a global similarity affinity matrix across all input images to model global correlations through feature aggregation based on inter-image consistency, rather than solely learning features from individual images as in most existing approaches. Simultaneously, LCM mines and aligns local features of positive samples with contextual similarity to extract local correlations by maintaining a dynamic memory bank, effectively compensating for missing or occluded regions in individual images. To further enhance feature robustness, MCFormer integrates global and local features that have been respectively correlated across multiple scales, effectively capturing latent relationships among image features. Experiments on three benchmarks demonstrate that MCFormer achieves state-of-the-art performance.

Paper Structure

This paper contains 17 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Comparison of intra-identity variations in person/vehicle and vessel images, where the image frames in the borders of same color denote that the corresponding images share the same person/vehicle/vessel identity. (b) Illustration of vessel part annotations and local missing in different viewing conditions. Each image of vessel exhibits partial occlusion or missing regions due to varying viewpoint and distance.
  • Figure 2: An illustration of our proposed MCFormer. The input images is first fed into the Transformer Encoder, which extracts global features using the class token $x_{\mathrm{cls}}$ and all image tokens, and local features using the part tokens $x_{p_i}$ and the corresponding regional image tokens. The global features are processed by the GCM, where they are first mapped from a high-dimensional space to a lower-dimensional space using a small set of randomly sampled landmarks. A global affinity matrix is then computed and sparsified, followed by a weighted aggregation to obtain the correlated global representations. The local features are processed by the LCM, where positive samples are identified and a clustering loss is applied to pull neighboring positive samples closer together. The correlated local representation is then obtained through concatenation and dimensionality reduction. Finally, the correlated global and local features are fused via a multi-scale channel attention module, resulting in a robust representation used for vessel Re-ID retrieval.
  • Figure 3: Visualization of the ranking results on the VesselReID test set. Green and red boxes indicate correct and incorrect retrieval results respectively.
  • Figure 4: Attention maps visualization of the proposed MCFormer under different component settings. (a) Input image of the vessel. (b) Adding GCM only. (c) Adding LCM only. (d) MCFormer.
  • Figure 5: This figure illustrates how mAP and GFLOPs change with different values of $l$. The brown line represents VesselReID, while the blue line represents Warships-ReID.