Table of Contents
Fetching ...

SIRA: Scalable Inter-frame Relation and Association for Radar Perception

Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi

TL;DR

This paper introduces extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability and proposes motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association.

Abstract

Conventional radar feature extraction faces limitations due to low spatial resolution, noise, multipath reflection, the presence of ghost targets, and motion blur. Such limitations can be exacerbated by nonlinear object motion, particularly from an ego-centric viewpoint. It becomes evident that to address these challenges, the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistency for effective association. To this end, this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First, inspired by Swin Transformer, we introduce extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second, we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset, surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA, respectively.

SIRA: Scalable Inter-frame Relation and Association for Radar Perception

TL;DR

This paper introduces extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability and proposes motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association.

Abstract

Conventional radar feature extraction faces limitations due to low spatial resolution, noise, multipath reflection, the presence of ghost targets, and motion blur. Such limitations can be exacerbated by nonlinear object motion, particularly from an ego-centric viewpoint. It becomes evident that to address these challenges, the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistency for effective association. To this end, this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First, inspired by Swin Transformer, we introduce extended temporal relation, generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second, we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset, surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA, respectively.

Paper Structure

This paper contains 69 sections, 40 equations, 21 figures, 9 tables, 1 algorithm.

Figures (21)

  • Figure 1: Conventional radar perception pipelines such as TempoRadar Li2022_TemporalRelations (Bottom Row) rely on a limited number (one or two) of frames and the limited time horizon may lead to incorrect feature-level and object-level association (e.g., $t=T-1$) and propagate to subsequent frames (e.g., $t=T$). In contrast, SIRA (Top Row) accounts for joint spatio-temporal consistency over an extended temporal horizon (e.g., all $3$ frames here), allowing for more accurate association in nonlinear motion scenarios even in an ego-centric viewpoint.
  • Figure 2: The architecture of SIRA with two modules: 1) extended temporal relation (ETR) capturing the temporal feature consistency while maintaining computational efficiency, and 2) motion consistency track (MCTrack) estimating pseudo-direction of objects during training and establishing pseudo-tracklets for better association in inference. The detection loss $\mathcal{L}_{t}^{\text{BBox}}$ and pseudo-direction loss $\mathcal{L}^{\text{DEst}}$ are used to train the pipeline end-to-end for object detection and tracking.
  • Figure 3: The TRWA block of the ETR module. Each frame is partitioned into sub-frame patches (in two contrasting colors of each frame in Top Left) and these patches are regrouped into new windows (Top Right) in a deformable temporal order (arrow lines). Masked multi-head cross-attention (MCA) is applied to new regrouped windows for scalable cross-window attention.
  • Figure 4: Direction Estimation (DEst) decoder head. Each DEst head takes a pair of $2$ frames $\mathbf{Z}_T$ and $\mathbf{Z}_{T-\tau}$, and estimates the pseudo-direction $\widehat{\mathbf{d}}_{T\mid T-\tau}$ (arrow lines in red).
  • Figure 5: The calculation of similarity metrics $C^{\text{angle}}$ and $C^{\text{tracklet}}$ in MCTrack at inference. A pseudo-tracklet $\{\left\{{\widehat{\mathbf{z}}_t}\right\}_{t=1}^{T}, \left\{{\widehat{\mathbf{v}}_t}\right\}_{t=2}^{T}\}$ is constructed with $\widehat{\mathbf{d}}_{T|T-\tau}$ estimated with DEst, and is used for association: (Top) rotating a state $\mathbf{x}_{T|T-1}$ to be more correlate the observation $\mathbf{z}_T$, (Bottom) directly correlating the observations $\mathbf{z}_t$ with $\widehat{\mathbf{z}}_t$.
  • ...and 16 more figures