Table of Contents
Fetching ...

RETR: Multi-View Radar Detection Transformer for Indoor Perception

Ryoma Yataka, Adriano Cardace, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi

TL;DR

RETR addresses indoor perception with multi-view radar by adapting the DETR transformer framework to fuse horizontal and vertical radar heatmaps. It introduces depth-prioritized tunable positional encoding, a tri-plane set-prediction loss that jointly supervises radar and image planes, and a learnable radar-to-camera transformation constrained to the SO(3) group via a Lie-algebra-based reparameterization. The architecture uses Top-K feature selection to manage complexity, and a cross-view encoder plus decoder learns 3D spatial embeddings for objects, projecting 3D radar boxes into the image plane for detection and segmentation. Empirical results on MMVR and HIBER show substantial improvements over RFMask and DETR baselines, with notable gains from incorporating TPE and tri-plane supervision, validating RETR as a strong, end-to-end approach for indoor radar perception with practical inference times. The work also discusses limitations (e.g., arm-position accuracy and ghost targets) and broader implications for privacy-preserving yet potentially privacy-invasive indoor sensing.

Abstract

Indoor radar perception has seen rising interest due to affordable costs driven by emerging automotive imaging radar developments and the benefits of reduced privacy concerns and reliability under hazardous conditions (e.g., fire and smoke). However, existing radar perception pipelines fail to account for distinctive characteristics of the multi-view radar setting. In this paper, we propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi-view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane. More importantly, RETR incorporates carefully designed modifications such as 1) depth-prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri-plane loss from both radar and camera coordinates; and 3) a learnable radar-to-camera transformation via reparameterization, to account for the unique multi-view radar setting. Evaluated on two indoor radar perception datasets, our approach outperforms existing state-of-the-art methods by a margin of 15.38+ AP for object detection and 11.91+ IoU for instance segmentation, respectively. Our implementation is available at https://github.com/merlresearch/radar-detection-transformer.

RETR: Multi-View Radar Detection Transformer for Indoor Perception

TL;DR

RETR addresses indoor perception with multi-view radar by adapting the DETR transformer framework to fuse horizontal and vertical radar heatmaps. It introduces depth-prioritized tunable positional encoding, a tri-plane set-prediction loss that jointly supervises radar and image planes, and a learnable radar-to-camera transformation constrained to the SO(3) group via a Lie-algebra-based reparameterization. The architecture uses Top-K feature selection to manage complexity, and a cross-view encoder plus decoder learns 3D spatial embeddings for objects, projecting 3D radar boxes into the image plane for detection and segmentation. Empirical results on MMVR and HIBER show substantial improvements over RFMask and DETR baselines, with notable gains from incorporating TPE and tri-plane supervision, validating RETR as a strong, end-to-end approach for indoor radar perception with practical inference times. The work also discusses limitations (e.g., arm-position accuracy and ghost targets) and broader implications for privacy-preserving yet potentially privacy-invasive indoor sensing.

Abstract

Indoor radar perception has seen rising interest due to affordable costs driven by emerging automotive imaging radar developments and the benefits of reduced privacy concerns and reliability under hazardous conditions (e.g., fire and smoke). However, existing radar perception pipelines fail to account for distinctive characteristics of the multi-view radar setting. In this paper, we propose Radar dEtection TRansformer (RETR), an extension of the popular DETR architecture, tailored for multi-view radar perception. RETR inherits the advantages of DETR, eliminating the need for hand-crafted components for object detection and segmentation in the image plane. More importantly, RETR incorporates carefully designed modifications such as 1) depth-prioritized feature similarity via a tunable positional encoding (TPE); 2) a tri-plane loss from both radar and camera coordinates; and 3) a learnable radar-to-camera transformation via reparameterization, to account for the unique multi-view radar setting. Evaluated on two indoor radar perception datasets, our approach outperforms existing state-of-the-art methods by a margin of 15.38+ AP for object detection and 11.91+ IoU for instance segmentation, respectively. Our implementation is available at https://github.com/merlresearch/radar-detection-transformer.

Paper Structure

This paper contains 64 sections, 22 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: By taking horizontal-view and vertical-view radar heatmaps as inputs, RETR introduces a depth-prioritizing positional encoding (exploit the shared depth between the two radar views) into transformer self-attention and cross-attention modules and outputs a set of 3D-embedding object queries to support image-plane object detection and segmentation via a calibrated or learnable radar-to-camera coordinate transformation and 3D-to-2D pinhole camera projection.
  • Figure 2: Indoor radar perception pipeline: (a) multi-radar views are utilized to estimate 3D BBoxes in the radar coordinate system; (b) the 3D BBoxes are then transformed into the 3D camera coordinate system by a radar-to-camera transformation; and (c) the transformed 3D BBoxes are projected onto the image plane for final object detection. Blue line denotes a fixed-height regional proposal in RFMask, while Magenta line denotes an object query with learnble height in RETR.
  • Figure 3: The RETR architecture: 1) Encoder: Top-$K$ features selection and tunable positional encoding to assist feature association across the two radar views; 2) Decoder: TPE is also used to assist the association between object queries and multi-view radar features; 3) 3D BBox Head: Object queries are enforced to estimate 3D objects in the radar coordinate and projected to $3$ planes for supervision via a coordinate transformation; 4) Segmentation Head: The same queries are used to predict binary pixels within each predicted BBox in the image plane.
  • Figure 4: Schemes of positional encoding: (a) the sum operation in the original DETR; (b) the concatenation in Conditional DETR; and (c) TPE in RETR that allows for adjustable dimensions between depth and angular embeddings and promotes higher similarity scores for keys and queries with similar depth embeddings than those far apart in depth.
  • Figure 5: Tri-Plane BBox loss.
  • ...and 10 more figures