WS-DETR: Robust Water Surface Object Detection through Vision-Radar Fusion with Detection Transformer
Huilin Yin, Pengyu Wang, Senmao Li, Jun Yan, Daniel Watzenig
TL;DR
WS-DETR tackles robust water-surface object detection for USVs by fusing camera images with 4D-mmWave radar within a DETR-based framework. It introduces Adaptive Feature Interactive Fusion for cross-modal alignment, Multi-Scale Edge Information Integration for boundary preservation, Hierarchical Feature Aggregator for cross-scale encoder fusion, and a self-moving point based SMPBlock radar backbone to handle irregular radar data. On the WaterScenes dataset, WS-DETR achieves state-of-the-art performance with high accuracy and strong robustness under adverse weather and lighting, while maintaining a compact parameter count and reasonable compute. The approach offers practical benefits for resilient USV navigation by delivering reliable multi-object detection across challenging aquatic environments.
Abstract
Robust object detection for Unmanned Surface Vehicles (USVs) in complex water environments is essential for reliable navigation and operation. Specifically, water surface object detection faces challenges from blurred edges and diverse object scales. Although vision-radar fusion offers a feasible solution, existing approaches suffer from cross-modal feature conflicts, which negatively affect model robustness. To address this problem, we propose a robust vision-radar fusion model WS-DETR. In particular, we first introduce a Multi-Scale Edge Information Integration (MSEII) module to enhance edge perception and a Hierarchical Feature Aggregator (HiFA) to boost multi-scale object detection in the encoder. Then, we adopt self-moving point representations for continuous convolution and residual connection to efficiently extract irregular features under the scenarios of irregular point cloud data. To further mitigate cross-modal conflicts, an Adaptive Feature Interactive Fusion (AFIF) module is introduced to integrate visual and radar features through geometric alignment and semantic fusion. Extensive experiments on the WaterScenes dataset demonstrate that WS-DETR achieves state-of-the-art (SOTA) performance, maintaining its superiority even under adverse weather and lighting conditions.
