Table of Contents
Fetching ...

Human Detection from 4D Radar Data in Low-Visibility Field Conditions

Mikael Skog, Oleksandr Kotlyar, Vladimír Kubelka, Martin Magnusson

TL;DR

This work tackles robust human detection in low-visibility environments by leveraging 4D radar data. It introduces TMVA4D, a temporal multi-view CNN that operates on five radar heatmaps (EA, ER, ED, RA, DA) to perform semantic segmentation of pedestrians versus background, trained on a novel mining/industrial dataset with ground-truth from thermal imagery. The approach achieves an overall mean IoU of 78.28% and mean Dice of 86.1% on the two-class task, demonstrating strong background suppression and practical viability under dust, smoke, and other particulates. By exploiting the elevation dimension and temporal context, TMVA4D provides a radar-based perception solution with potential impact for safer autonomous operation in harsh mining and industrial settings.

Abstract

Autonomous driving technology is increasingly being used on public roads and in industrial settings such as mines. While it is essential to detect pedestrians, vehicles, or other obstacles, adverse field conditions negatively affect the performance of classical sensors such as cameras or lidars. Radar, on the other hand, is a promising modality that is less affected by, e.g., dust, smoke, water mist or fog. In particular, modern 4D imaging radars provide target responses across the range, vertical angle, horizontal angle and Doppler velocity dimensions. We propose TMVA4D, a CNN architecture that leverages this 4D radar modality for semantic segmentation. The CNN is trained to distinguish between the background and person classes based on a series of 2D projections of the 4D radar data that include the elevation, azimuth, range, and Doppler velocity dimensions. We also outline the process of compiling a novel dataset consisting of data collected in industrial settings with a car-mounted 4D radar and describe how the ground-truth labels were generated from reference thermal images. Using TMVA4D on this dataset, we achieve an mIoU score of 78.2% and an mDice score of 86.1%, evaluated on the two classes background and person

Human Detection from 4D Radar Data in Low-Visibility Field Conditions

TL;DR

This work tackles robust human detection in low-visibility environments by leveraging 4D radar data. It introduces TMVA4D, a temporal multi-view CNN that operates on five radar heatmaps (EA, ER, ED, RA, DA) to perform semantic segmentation of pedestrians versus background, trained on a novel mining/industrial dataset with ground-truth from thermal imagery. The approach achieves an overall mean IoU of 78.28% and mean Dice of 86.1% on the two-class task, demonstrating strong background suppression and practical viability under dust, smoke, and other particulates. By exploiting the elevation dimension and temporal context, TMVA4D provides a radar-based perception solution with potential impact for safer autonomous operation in harsh mining and industrial settings.

Abstract

Autonomous driving technology is increasingly being used on public roads and in industrial settings such as mines. While it is essential to detect pedestrians, vehicles, or other obstacles, adverse field conditions negatively affect the performance of classical sensors such as cameras or lidars. Radar, on the other hand, is a promising modality that is less affected by, e.g., dust, smoke, water mist or fog. In particular, modern 4D imaging radars provide target responses across the range, vertical angle, horizontal angle and Doppler velocity dimensions. We propose TMVA4D, a CNN architecture that leverages this 4D radar modality for semantic segmentation. The CNN is trained to distinguish between the background and person classes based on a series of 2D projections of the 4D radar data that include the elevation, azimuth, range, and Doppler velocity dimensions. We also outline the process of compiling a novel dataset consisting of data collected in industrial settings with a car-mounted 4D radar and describe how the ground-truth labels were generated from reference thermal images. Using TMVA4D on this dataset, we achieve an mIoU score of 78.2% and an mDice score of 86.1%, evaluated on the two classes background and person
Paper Structure (11 sections, 7 equations, 5 figures, 2 tables)

This paper contains 11 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Predicted mask in the camera (elevation-azimuth) view, produced by our architecture TMVA4D (class background in black and class person in red). The data used to make the prediction are shown here in their original form as a point cloud, with three bounding boxes drawn around points corresponding to a distinct person. (A fourth person is standing close to a wall, and cannot be recognized in any figure.) A "wall" of sprayed water blocks the view of lidars and cameras (see (\ref{['fig:front-color']}) and (\ref{['fig:front-thermal']})), without obstructing the 4D radar's view.
  • Figure 2: Point cloud projected to the EA view with corresponding EA heatmap. (\ref{['fig:proj-example']}) is a graphical representation of the points of a point cloud projected to the EA view, overlaid on top of the point cloud's temporally closest thermal image. (\ref{['fig:ea-proj-example']}) shows an EA heatmap generated from the point cloud. Brighter colors indicate higher values in the EA heatmap matrix (viridis colormap).
  • Figure 3: Overview of the proposed TMVA4D architecture. The architecture takes as input data in the EA, ER, ED, RA, and DA views to output predictions in the annotated EA view. Heatmaps in the input views at time $t$ and of the $q$ previous frames are used to output predictions for the frame at time $t$. These predictions are segmentation masks of the output view, with black here representing class background and red representing class person.
  • Figure 4: TMVA4D components. Each encoder takes a single channel input consisting of the heatmaps at time $t$ and of the previous $q$ frames, stacked depthwise. The encoder input height and width are those of the view. Each 3D convolution reduces the depth of the feature maps by 2. The first pooling layer is absent in the EA encoder. Each ASPP aspp module performs parallel convolutions at different dilation (dil) rates. $K$ represents the number of classes, here 2.
  • Figure 5: Comparison between ground truth and TMVA4D predictions on the test set. In each subfigure: predictions for a particular frame (right) and the image used to produce said annotations (left). Class background is shown in black and class person is shown in red.