Table of Contents
Fetching ...

physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

Yuting Wan, Liguo Sun, Jiuwu Hao, Zao Zhang, Pin LV

TL;DR

PhysFusion is proposed, a physics-informed radar-image detection framework for water-surface perception that integrates a Physics-Informed Radar Encoder, SASA-based global reasoning, and RIFM and aggregates frame-wise fused queries over a short temporal window for temporally consistent representations.

Abstract

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.

physfusion: A Transformer-based Dual-Stream Radar and Vision Fusion Framework for Open Water Surface Object Detection

TL;DR

PhysFusion is proposed, a physics-informed radar-image detection framework for water-surface perception that integrates a Physics-Informed Radar Encoder, SASA-based global reasoning, and RIFM and aggregates frame-wise fused queries over a short temporal window for temporally consistent representations.

Abstract

Detecting water-surface targets for Unmanned Surface Vehicles (USVs) is challenging due to wave clutter, specular reflections, and weak appearance cues in long-range observations. Although 4D millimeter-wave radar complements cameras under degraded illumination, maritime radar point clouds are sparse and intermittent, with reflectivity attributes exhibiting heavy-tailed variations under scattering and multipath, making conventional fusion designs struggle to exploit radar cues effectively. We propose PhysFusion, a physics-informed radar-image detection framework for water-surface perception. The framework integrates: (1) a Physics-Informed Radar Encoder (PIR Encoder) with an RCS Mapper and Quality Gate, transforming per-point radar attributes into compact scattering priors and predicting point-wise reliability for robust feature learning under clutter; (2) a Radar-guided Interactive Fusion Module (RIFM) performing query-level radar-image fusion between semantically enriched radar features and multi-scale visual features, with the radar branch modeled by a dual-stream backbone including a point-based local stream and a transformer-based global stream using Scattering-Aware Self-Attention (SASA); and (3) a Temporal Query Aggregation module (TQA) aggregating frame-wise fused queries over a short temporal window for temporally consistent representations. Experiments on WaterScenes and FLOW demonstrate that PhysFusion achieves 59.7% mAP50:95 and 90.3% mAP50 on WaterScenes (T=5 radar history) using 5.6M parameters and 12.5G FLOPs, and reaches 94.8% mAP50 and 46.2% mAP50:95 on FLOW under radar+camera setting. Ablation studies quantify the contributions of PIR Encoder, SASA-based global reasoning, and RIFM.
Paper Structure (47 sections, 22 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 47 sections, 22 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of PhysFusion. The framework consists of three contribution-aligned modules: (i) PIR Encoder, where an RCS Mapper produces a compact scattering prior $s$ and a Quality Gate predicts a point-wise reliability score $g$ to modulate radar features under heavy-tailed reflectivity and intermittent returns; (ii) RIFM, which adopts a dual-stream radar backbone with a point-based local stream and a Transformer-based global stream equipped with SASA (Scattering-Aware Self-Attention), together with cross-stream interaction to exchange local details and global context, and performs query-level radar--image fusion; and (iii) TQA-GRU, which aggregates per-frame fused cross-modal queries over a temporal window using shared weights to obtain temporally consistent query representations. The aggregated queries are finally fed into a detection head to output 2D water-surface object detections.
  • Figure 2: Illustration of the PIR Encoder with the RCS Mapper and Quality Gate. Left: raw radar returns on the water-surface plane include target points and clutter/outliers. Middle: each point $p_i=(x_i,y_i,z_i,v_i,\mathrm{RCS}_i)$ is mapped to a compact scattering prior $s_i$ and a point-wise reliability score $g_i$. Right: the scattering prior augments radar features, while the confidence gate down-weights unreliable returns (e.g., heavy-tailed or intermittent scattering), yielding a reliability-aware radar representation for subsequent encoding and fusion.
  • Figure 3: Dual-stream radar backbone for radar point encoding. (a) Point-based local stream performs local graph aggregation and injects global context via a pooled summary and a lightweight context MLP, producing locality-preserving features with contextual cues. (b) Global Transformer block for radar modeling, where SASA incorporates a distance-decay prior from point coordinates to suppress spurious long-range interactions and improve global reasoning over sparse returns.
  • Figure 4: Temporal Query Aggregation Module (TQA-GRU). (a) Query-level temporal aggregation over a window of radar and image features $\{R_{t-k}, I_{t-k}\}$, where the GRU is shared across frames and query tokens to update the hidden state $h_{m,k}$. (b) Per-query GRU cell detail: ego-motion cues (translation/velocity/rotation) are embedded and linearly projected to modulate the recurrent update (ConvGRU), enabling motion-aware temporal fusion of query representations.
  • Figure 5: Per-image object statistics on the training set. For each sampled image, we report the total number of labeled instances and the number of instances that satisfy a small-size criterion under the input resolution used for training.
  • ...and 5 more figures