Table of Contents
Fetching ...

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

Kavin Chandrasekaran, Sorin Grigorescu, Gijs Dubbelman, Pavol Jancura

TL;DR

This work directly uses the raw range-Doppler spectrum of radar data to process camera images, thus avoiding radar signal processing and evaluating the fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

Abstract

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data

TL;DR

This work directly uses the raw range-Doppler spectrum of radar data to process camera images, thus avoiding radar signal processing and evaluating the fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

Abstract

Cameras can be used to perceive the environment around the vehicle, while affordable radar sensors are popular in autonomous driving systems as they can withstand adverse weather conditions unlike cameras. However, radar point clouds are sparser with low azimuth and elevation resolution that lack semantic and structural information of the scenes, resulting in generally lower radar detection performance. In this work, we directly use the raw range-Doppler (RD) spectrum of radar data, thus avoiding radar signal processing. We independently process camera images within the proposed comprehensive image processing pipeline. Specifically, first, we transform the camera images to Bird's-Eye View (BEV) Polar domain and extract the corresponding features with our camera encoder-decoder architecture. The resultant feature maps are fused with Range-Azimuth (RA) features, recovered from the RD spectrum input from the radar decoder to perform object detection. We evaluate our fusion strategy with other existing methods not only in terms of accuracy but also on computational complexity metrics on RADIal dataset.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 2: Architecture Overview: The image processing pipeline first transforms the camera image into Bird's-Eye View (BEV). Subsequently, the resultant BEV undergoes conversion into polar representation, directly mapping to the Range-Azimuth (RA) image. Object detection is performed on RA image features fused with radar features from the radar decoder. The predictions obtained in the RA view are shown in the camera images with ground-truth bounding boxes in green and predictions in blue.
  • Figure 3: Image Processing Pipeline: The objects in the frame (four cars) marked in different colors are reflected in the BEV Cartesian and Polar pixel images. The origin is at the bottom center. The azimuth $(\theta)$, range $(r)$ ground truth polar coordinates are marked for reference. $r$ denotes the distance from the objects to the ego vehicle (in meters); $\theta$ represents the angle at which the objects are located in degrees.
  • Figure 4: The camera only and radar only encoder contains four ResNet-50-like blocks with a pre-encoder block. The features from each of those blocks are named x0, x1, x2, x3, and x4. The thick blue curved arrow takes the encoder's output to the decoder's input in order to expand the input feature maps to higher resolutions. The dotted lines represent the skip connections used to preserve spatial information. The features from the camera only decoder and radar only decoder are then fused before passing them to the detection head. The head finally predicts the objects in Bird's Eye RA Polar View, as shown in Fig. \ref{['figarch:introoverview']}.
  • Figure 5: Qualitative detection results from the proposed fusion model. The predictions obtained in the RA view (represented as blue boxes in top row) have been projected onto the camera images with ground truth bounding boxes in green.
  • Figure 6: The prediction in blue and the ground truth in green are shown in (a) front-view camera and (b) BEV Polar image. Zoom in to better visualize.