Table of Contents
Fetching ...

WS-DETR: Robust Water Surface Object Detection through Vision-Radar Fusion with Detection Transformer

Huilin Yin, Pengyu Wang, Senmao Li, Jun Yan, Daniel Watzenig

TL;DR

WS-DETR tackles robust water-surface object detection for USVs by fusing camera images with 4D-mmWave radar within a DETR-based framework. It introduces Adaptive Feature Interactive Fusion for cross-modal alignment, Multi-Scale Edge Information Integration for boundary preservation, Hierarchical Feature Aggregator for cross-scale encoder fusion, and a self-moving point based SMPBlock radar backbone to handle irregular radar data. On the WaterScenes dataset, WS-DETR achieves state-of-the-art performance with high accuracy and strong robustness under adverse weather and lighting, while maintaining a compact parameter count and reasonable compute. The approach offers practical benefits for resilient USV navigation by delivering reliable multi-object detection across challenging aquatic environments.

Abstract

Robust object detection for Unmanned Surface Vehicles (USVs) in complex water environments is essential for reliable navigation and operation. Specifically, water surface object detection faces challenges from blurred edges and diverse object scales. Although vision-radar fusion offers a feasible solution, existing approaches suffer from cross-modal feature conflicts, which negatively affect model robustness. To address this problem, we propose a robust vision-radar fusion model WS-DETR. In particular, we first introduce a Multi-Scale Edge Information Integration (MSEII) module to enhance edge perception and a Hierarchical Feature Aggregator (HiFA) to boost multi-scale object detection in the encoder. Then, we adopt self-moving point representations for continuous convolution and residual connection to efficiently extract irregular features under the scenarios of irregular point cloud data. To further mitigate cross-modal conflicts, an Adaptive Feature Interactive Fusion (AFIF) module is introduced to integrate visual and radar features through geometric alignment and semantic fusion. Extensive experiments on the WaterScenes dataset demonstrate that WS-DETR achieves state-of-the-art (SOTA) performance, maintaining its superiority even under adverse weather and lighting conditions.

WS-DETR: Robust Water Surface Object Detection through Vision-Radar Fusion with Detection Transformer

TL;DR

WS-DETR tackles robust water-surface object detection for USVs by fusing camera images with 4D-mmWave radar within a DETR-based framework. It introduces Adaptive Feature Interactive Fusion for cross-modal alignment, Multi-Scale Edge Information Integration for boundary preservation, Hierarchical Feature Aggregator for cross-scale encoder fusion, and a self-moving point based SMPBlock radar backbone to handle irregular radar data. On the WaterScenes dataset, WS-DETR achieves state-of-the-art performance with high accuracy and strong robustness under adverse weather and lighting, while maintaining a compact parameter count and reasonable compute. The approach offers practical benefits for resilient USV navigation by delivering reliable multi-object detection across challenging aquatic environments.

Abstract

Robust object detection for Unmanned Surface Vehicles (USVs) in complex water environments is essential for reliable navigation and operation. Specifically, water surface object detection faces challenges from blurred edges and diverse object scales. Although vision-radar fusion offers a feasible solution, existing approaches suffer from cross-modal feature conflicts, which negatively affect model robustness. To address this problem, we propose a robust vision-radar fusion model WS-DETR. In particular, we first introduce a Multi-Scale Edge Information Integration (MSEII) module to enhance edge perception and a Hierarchical Feature Aggregator (HiFA) to boost multi-scale object detection in the encoder. Then, we adopt self-moving point representations for continuous convolution and residual connection to efficiently extract irregular features under the scenarios of irregular point cloud data. To further mitigate cross-modal conflicts, an Adaptive Feature Interactive Fusion (AFIF) module is introduced to integrate visual and radar features through geometric alignment and semantic fusion. Extensive experiments on the WaterScenes dataset demonstrate that WS-DETR achieves state-of-the-art (SOTA) performance, maintaining its superiority even under adverse weather and lighting conditions.

Paper Structure

This paper contains 18 sections, 11 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of our USVs object detection method in complex water surface environments. Our method extracts and fuses features from the camera and 4D-mmWave radar, combines low-dimensional edge features, and fuses them with multi-scale, high-dimensional features to achieve robust water surface object detection under adverse conditions.
  • Figure 2: The network pipeline of WS-DETR. The model extracts multi-level features from both modalities, then performs fusion using the Adaptive Feature Interactive Fusion (AFIF) module, enriches high-dimensional features with low-level edge information via the Multi-Scale Edge Information Integration (MSEII) module, and finally applies Hierarchical Feature Aggregator (HiFA) module within the encoder for multi-scale feature fusion before detection.
  • Figure 3: The structure of the Adaptive Feature Interactive Fusion (AFIF) module. The AFIF architecture comprises two cascaded stages: Feature Synchronization Fusion (FSF) for adaptive information supplementation of radar features and Feature Selection Enhancement (FSE) for cross-modal feature alignment enhancement of image features, which ultimately generates an enhanced image feature.
  • Figure 4: The structure of the Hierarchical Feature Aggregator (HiFA) module and its integration in the encoder. HiFA fuses multi-scale features via scale-specific transformations and a dynamic weighting mechanism. The encoder employs HiFA modules to enhance cross-scale information flow, followed by upsampling and convolution to diffuse features across levels, enabling robust object detection under scale variations.
  • Figure 5: Comparison of experimental results. We select YOLOv11 and RT-DETR, which achieve strong performance on the dataset, to compare with our proposed model WS-DETR under different environmental conditions. The first row shows results in rainy scenes, the second in low-light environments, the third in scenes with multi-scale objects, and the last in scenes with edge-blurred objects.
  • ...and 1 more figures