Table of Contents
Fetching ...

REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion

Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi

TL;DR

REXO tackles indoor multi-view radar object detection by diffusing 3D bounding boxes directly in radar space and conditioning denoising on explicit cross-view features from horizontal and vertical heatmaps. A ground-level constraint reduces diffusion parameters and enforces floor contact, while a cross-view denoising detector recovers $x_0$ with a learnable 3D-to-2D refinement that yields tighter image-plane Boxes. Evaluations on MMVR and HIBER show substantial gains over RFMask, DETR, and RETR, including an approximate +11.02 AP improvement on MMVR and +4.22 AP on HIBER, with strong generalization to unseen environments. The approach demonstrates robust performance under varying numbers of BBoxes during inference and offers a practical runtime-accuracy trade-off by tuning diffusion steps, highlighting the method's potential for reliable indoor radar perception in challenging lighting and privacy-sensitive scenarios.

Abstract

Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset.

REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion

TL;DR

REXO tackles indoor multi-view radar object detection by diffusing 3D bounding boxes directly in radar space and conditioning denoising on explicit cross-view features from horizontal and vertical heatmaps. A ground-level constraint reduces diffusion parameters and enforces floor contact, while a cross-view denoising detector recovers with a learnable 3D-to-2D refinement that yields tighter image-plane Boxes. Evaluations on MMVR and HIBER show substantial gains over RFMask, DETR, and RETR, including an approximate +11.02 AP improvement on MMVR and +4.22 AP on HIBER, with strong generalization to unseen environments. The approach demonstrates robust performance under varying numbers of BBoxes during inference and offers a practical runtime-accuracy trade-off by tuning diffusion steps, highlighting the method's potential for reliable indoor radar perception in challenging lighting and privacy-sensitive scenarios.

Abstract

Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset.

Paper Structure

This paper contains 55 sections, 45 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: (a) RFMask Wu2023_RFMask generates horizontal-view proposals with fixed-height vertical boxes; (b) RETR Yataka2024_retr implicitly links queries to cross-view features via decoder cross-attention; (c) DiffusionDet Chen2023_diffusiondet adapted to horizontal radar allows 2D denoising but needs extra pairing with fixed-height vertical boxes; (d) REXO (ours) performs diffusion directly in 3D radar space for simple, explicit cross-view association.
  • Figure 2: Generation of multi-view heatmaps from raw data.
  • Figure 3: REXO: 1) 3D BBox diffusion process in the radar space; 2) Geometric transformation and 3D-to-2D projection onto the image plane for geometry-aware supervision.
  • Figure 3: The ground-level constraint can improve the detection performance on both datasets.
  • Figure 4: REXO training: 1) A shared backbone extracts horizontal/vertical radar features $\{\boldsymbol{Z}_{\mathtt{hor}},\boldsymbol{Z}_{\mathtt{ver}}\}$; 2) Ground‑truth 3D BBoxes $\boldsymbol{x}_0$ are diffused to noisy $\boldsymbol{x}_t$; 3) $\boldsymbol{x}_t$ is grounded using a ground-level constraint; 4) $\mathtt{DenoisingDet}_{\theta}$ projects $\boldsymbol{x}_t$ onto both views and uses the aligned features to recover $\hat{\boldsymbol{x}}_0$; 5) A radar‑to‑camera transform and 3D-to-2D projection yield image BBoxes $\hat{\boldsymbol{b}}_{\mathtt{image}}$, enabling geometry‑aware supervision in radar space and image plane.
  • ...and 13 more figures