Table of Contents
Fetching ...

calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

Yuting Wan, Liguo Sun, Jiuwu Hao, Pin LV

Abstract

Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.

calibfusion: Transformer-Based Differentiable Calibration for Radar-Camera Fusion Detection in Water-Surface Environments

Abstract

Millimeter-wave (mmWave) Radar--Camera fusion improves perception under adverse illumination and weather, but its performance is sensitive to Radar--Camera extrinsic calibration: residual misalignment biases Radar-to-image projection and degrades cross-modal aggregation for downstream 2D detection. Existing calibration and auto-calibration methods are mainly developed for road and urban scenes with abundant structures and object constraints, whereas water-surface environments feature large textureless regions, sparse and intermittent targets, and wave-/specular-induced Radar clutter, which weakens explicit object-centric matching. We propose CalibFusion, a calibration-conditioned Radar--Camera fusion detector that learns implicit extrinsic refinement end-to-end with the detection objective. CalibFusion builds a multi-frame persistence-aware Radar density representation with intensity weighting and Doppler-guided suppression of fast-varying clutter. A cross-modal transformer interaction module predicts a confidence-gated refinement of the initial extrinsics, which is integrated through a differentiable projection-and-splatting operator to generate calibration-conditioned image-plane Radar features. Experiments on WaterScenes and FLOW show improved fusion-based 2D detection and robustness under synthetic miscalibration, supported by sensitivity analyses and qualitative Radar-to-image overlays. Results on nuScenes indicate that the refinement mechanism transfers beyond water-surface scenarios.
Paper Structure (42 sections, 34 equations, 6 figures, 3 tables)

This paper contains 42 sections, 34 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overall architecture of CalibFusion. CalibFusion includes: (1) Doppler-Guided Persistence Density for multi-frame Radar density construction; (2) Cross-Modal Token Interaction between image and Radar tokens; (3) Confidence-Gated Extrinsic Refinement that updates the initial extrinsic $T_0$; and (4) Calibration-Conditioned Projection that generates image-plane Radar features for fusion-based 2D detection. The model is trained end-to-end with detection supervision backpropagated through the differentiable projection.
  • Figure 2: Cross-Modal Token Interaction via bi-directional cross-attention. (a) Radar-to-image attention updates Radar tokens using Radar queries and image keys/values. (b) Image-to-Radar attention updates image tokens using image queries and Radar keys/values. Layer normalization and residual updates are applied around each attention block.
  • Figure 3: Calibration-conditioned projection results. (a,d) Miscalibrated Radar-to-image projections under $T_0'$. (b,e) Refined projections after CalibFusion predicts $T_t$, improving cross-modal alignment. (c,f) Corresponding multi-frame Radar density maps used for refinement.
  • Figure 4: Calibration error statistics on the WaterScenes test set under a broad range of initial perturbations: (a) distribution of translation errors; (b) distribution of rotation errors.
  • Figure 5: Refinement error under axis-wise injected extrinsic offsets. Top row: translation error (m); bottom row: rotation error (deg). In each subplot, the solid curve corresponds to the perturbed axis (X or Y), while the dashed curve (“Compared”) reports the error on the other axis to indicate cross-axis coupling. (a,c) Y-axis offsets; (b,d) X-axis offsets.
  • ...and 1 more figures