Table of Contents
Fetching ...

Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving

Yuzhi Wu, Jun Liu, Guangfeng Jiang, Weijian Liu, Danilo Orlando

TL;DR

This work addresses radar-based object detection for autonomous driving by exploiting spatial-temporal semantic context often neglected in radar encoders. It introduces Mask-RadarNet, a 3D transformer that interleaves convolutions with self-attention, employs PatchShift for efficient temporal fusion, and incorporates a class masking attention module (CMAM) plus an auxiliary decoder to generate semantic priors. The architecture achieves state-of-the-art performance on the CRUW dataset while reducing computational cost and parameter count, largely due to the PatchShift design and semantic priors. The results demonstrate that integrating spatial-temporal semantic context into radar sequence encoding improves detection accuracy, including for small objects, with practical implications for robust autonomous driving systems.

Abstract

As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years, making it an appealing complement to commonly used sensors like camera and LiDAR in autonomous driving. Radio frequency data with rich semantic information are attracting more and more attention. Most current radar-based models take radio frequency image sequences as the input. However, these models heavily rely on convolutional neural networks and leave out the spatial-temporal semantic context during the encoding stage. To solve these problems, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the input radar data. Mask-RadarNet exploits the combination of interleaved convolution and attention operations to replace the traditional architecture in transformer-based models. In addition, patch shift is introduced to the Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate the superiority of the proposed method to some state-of-the-art radar-based object detection algorithms. With relatively lower computational complexity and fewer parameters, the proposed Mask-RadarNet achieves higher recognition accuracy for object detection in autonomous driving.

Mask-RadarNet: Enhancing Transformer With Spatial-Temporal Semantic Context for Radar Object Detection in Autonomous Driving

TL;DR

This work addresses radar-based object detection for autonomous driving by exploiting spatial-temporal semantic context often neglected in radar encoders. It introduces Mask-RadarNet, a 3D transformer that interleaves convolutions with self-attention, employs PatchShift for efficient temporal fusion, and incorporates a class masking attention module (CMAM) plus an auxiliary decoder to generate semantic priors. The architecture achieves state-of-the-art performance on the CRUW dataset while reducing computational cost and parameter count, largely due to the PatchShift design and semantic priors. The results demonstrate that integrating spatial-temporal semantic context into radar sequence encoding improves detection accuracy, including for small objects, with practical implications for robust autonomous driving systems.

Abstract

As a cost-effective and robust technology, automotive radar has seen steady improvement during the last years, making it an appealing complement to commonly used sensors like camera and LiDAR in autonomous driving. Radio frequency data with rich semantic information are attracting more and more attention. Most current radar-based models take radio frequency image sequences as the input. However, these models heavily rely on convolutional neural networks and leave out the spatial-temporal semantic context during the encoding stage. To solve these problems, we propose a model called Mask-RadarNet to fully utilize the hierarchical semantic features from the input radar data. Mask-RadarNet exploits the combination of interleaved convolution and attention operations to replace the traditional architecture in transformer-based models. In addition, patch shift is introduced to the Mask-RadarNet for efficient spatial-temporal feature learning. By shifting part of patches with a specific mosaic pattern in the temporal dimension, Mask-RadarNet achieves competitive performance while reducing the computational burden of the spatial-temporal modeling. In order to capture the spatial-temporal semantic contextual information, we design the class masking attention module (CMAM) in our encoder. Moreover, a lightweight auxiliary decoder is added to our model to aggregate prior maps generated from the CMAM. Experiments on the CRUW dataset demonstrate the superiority of the proposed method to some state-of-the-art radar-based object detection algorithms. With relatively lower computational complexity and fewer parameters, the proposed Mask-RadarNet achieves higher recognition accuracy for object detection in autonomous driving.

Paper Structure

This paper contains 28 sections, 29 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Examples of RGB images and their corresponding RF images which represent the same scene. RF images are in range-azimuth coordinates.
  • Figure 2: Comparisons of Mask-RadarNet with other SOTA models on the CRUW dataset. Different models are represented by marks of different colors. Moreover, a smaller mark means a smaller model size.
  • Figure 3: Overview of proposed Mask-RadarNet. The encoder is in the middle, and the two decoders are on the left and right. The encoder is a hierarchical hybrid structure of convolution and self-attention mechanisms, which consists of the PatchShift 3D SwinTransformer module and the CMAM. The right decoder is the main decoder including the T-SwinTransformer module. The left decoder is the auxiliary decoder which generates the final prior maps. The orange lines represent the movement of query and key features from the PatchShift 3D SwinTransformer module to the main decoder. The blue lines represent the movement of query features from the CMAM to the auxiliary decoder.
  • Figure 4: An example of patch shift for three neighboring frames. The current frame $t$ aggregates information from neighboring frames $t-1$ and $t+1$.
  • Figure 5: Three typical shift patterns. Pattern A only shifts patches within 3 neighboring frames, while Pattern B has a temporal of 4 and Pattern C has a temporal field of 9.
  • ...and 7 more figures