Table of Contents
Fetching ...

AMANet: Advancing SAR Ship Detection with Adaptive Multi-Hierarchical Attention Network

Xiaolin Ma, Junkai Cheng, Aihua Li, Yuhua Zhang, Zhilong Lin

TL;DR

This work tackles the difficult problem of detecting ships in SAR imagery, focusing on small and coastal vessels where clutter and limited features hinder performance. It introduces AMANet, a plug-and-play detector built around the adaptive multi-hierarchical attention module (AMAM), which comprises a Multi-hierarchical Enhanced (ME) block for multi-scale feature fusion and an Adaptive Attention (AA) block for channel-wise, head-wise attention with learnable aggregation. The ME and AA blocks enable robust multi-scale feature aggregation and diverse attention maps, improving detection accuracy across SSDD and HRSID datasets and outperforming state-of-the-art methods, including inshore and offshore scenarios and across multiple YOLO backbones. These results demonstrate AMANet’s potential for practical SAR ship detection in cluttered coastal environments, with future work extending AMAM to Transformer-based backbones to further enhance performance.

Abstract

Recently, methods based on deep learning have been successfully applied to ship detection for synthetic aperture radar (SAR) images. Despite the development of numerous ship detection methodologies, detecting small and coastal ships remains a significant challenge due to the limited features and clutter in coastal environments. For that, a novel adaptive multi-hierarchical attention module (AMAM) is proposed to learn multi-scale features and adaptively aggregate salient features from various feature layers, even in complex environments. Specifically, we first fuse information from adjacent feature layers to enhance the detection of smaller targets, thereby achieving multi-scale feature enhancement. Then, to filter out the adverse effects of complex backgrounds, we dissect the previously fused multi-level features on the channel, individually excavate the salient regions, and adaptively amalgamate features originating from different channels. Thirdly, we present a novel adaptive multi-hierarchical attention network (AMANet) by embedding the AMAM between the backbone network and the feature pyramid network (FPN). Besides, the AMAM can be readily inserted between different frameworks to improve object detection. Lastly, extensive experiments on two large-scale SAR ship detection datasets demonstrate that our AMANet method is superior to state-of-the-art methods.

AMANet: Advancing SAR Ship Detection with Adaptive Multi-Hierarchical Attention Network

TL;DR

This work tackles the difficult problem of detecting ships in SAR imagery, focusing on small and coastal vessels where clutter and limited features hinder performance. It introduces AMANet, a plug-and-play detector built around the adaptive multi-hierarchical attention module (AMAM), which comprises a Multi-hierarchical Enhanced (ME) block for multi-scale feature fusion and an Adaptive Attention (AA) block for channel-wise, head-wise attention with learnable aggregation. The ME and AA blocks enable robust multi-scale feature aggregation and diverse attention maps, improving detection accuracy across SSDD and HRSID datasets and outperforming state-of-the-art methods, including inshore and offshore scenarios and across multiple YOLO backbones. These results demonstrate AMANet’s potential for practical SAR ship detection in cluttered coastal environments, with future work extending AMAM to Transformer-based backbones to further enhance performance.

Abstract

Recently, methods based on deep learning have been successfully applied to ship detection for synthetic aperture radar (SAR) images. Despite the development of numerous ship detection methodologies, detecting small and coastal ships remains a significant challenge due to the limited features and clutter in coastal environments. For that, a novel adaptive multi-hierarchical attention module (AMAM) is proposed to learn multi-scale features and adaptively aggregate salient features from various feature layers, even in complex environments. Specifically, we first fuse information from adjacent feature layers to enhance the detection of smaller targets, thereby achieving multi-scale feature enhancement. Then, to filter out the adverse effects of complex backgrounds, we dissect the previously fused multi-level features on the channel, individually excavate the salient regions, and adaptively amalgamate features originating from different channels. Thirdly, we present a novel adaptive multi-hierarchical attention network (AMANet) by embedding the AMAM between the backbone network and the feature pyramid network (FPN). Besides, the AMAM can be readily inserted between different frameworks to improve object detection. Lastly, extensive experiments on two large-scale SAR ship detection datasets demonstrate that our AMANet method is superior to state-of-the-art methods.
Paper Structure (27 sections, 11 equations, 7 figures, 5 tables)

This paper contains 27 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The difference between visible and SAR images. The first row shows visible images, and the second row shows SAR images. The green rectangles enclose the ground truth.
  • Figure 2: The structure of the AMAM. It consists of two main components: the multi-hierarchical enhanced block (ME) and the adaptive attention block (AA). The ME block leverages the contextual features from adjacent and deeper layers, aiding in accurate ship detection. The AA block splits the fused feature to each attention head, enhancing the diversity of attention maps and allowing for more discrimination to inshore clutter. Note that CBR is Convolution, Batch Normalization, and ReLU. $F_{i}$ is the feature map of the current layer. $c$, $h$, and $w$ are the Fused Feature's channel, height, and width, respectively, and $c_1$ = $c_i$ = $c_n$. $\alpha$, $\beta$ are learnable coefficients.
  • Figure 3: The network structure of the proposed AMANet. The figure showcases the integration of AMAM into the YOLO model (based on YOLOv8s), requiring additional backbone network features. CBS represents convolution, batch normalization, and SiLU activation. SPPF denotes the spatial pyramid pooling fusion module. The C2F module is a lightweight module inspired by c3 and incorporates ideas from ELAN.
  • Figure 4: Impact of number of heads in AMAM module on YOLOv8s
  • Figure 5: Impact of fusion functions in adaptive attention stage.
  • ...and 2 more figures