Table of Contents
Fetching ...

Fisheye Camera and Ultrasonic Sensor Fusion For Near-Field Obstacle Perception in Bird's-Eye-View

Arindam Das, Sudarshan Paul, Niko Scholz, Akhilesh Kumar Malviya, Ganesh Sistu, Ujjwal Bhattacharya, Ciarán Eising

TL;DR

This work tackles near-field obstacle perception by fusing a fisheye camera with an ultrasonic sensor in a bird's-eye-view (BEV) representation. It introduces an end-to-end CNN model with unimodal encoders, a BEV projection pipeline, and the CaMFuse fusion module that uses content-aware dilation to mitigate sensor misalignment, followed by a two-stage semantic occupancy decoder. A custom multisensor dataset with BEV ground truth is created, annotated, and analyzed, demonstrating that multimodal fusion outperforms unimodal baselines across indoor/outdoor scenes, obstacle distances, and motion conditions. The approach advances practical surround-view perception by leveraging complementary sensor strengths under challenging lighting and glare, with implications for safer, robust ADAS and autonomous driving systems.

Abstract

Accurate obstacle identification represents a fundamental challenge within the scope of near-field perception for autonomous driving. Conventionally, fisheye cameras are frequently employed for comprehensive surround-view perception, including rear-view obstacle localization. However, the performance of such cameras can significantly deteriorate in low-light conditions, during nighttime, or when subjected to intense sun glare. Conversely, cost-effective sensors like ultrasonic sensors remain largely unaffected under these conditions. Therefore, we present, to our knowledge, the first end-to-end multimodal fusion model tailored for efficient obstacle perception in a bird's-eye-view (BEV) perspective, utilizing fisheye cameras and ultrasonic sensors. Initially, ResNeXt-50 is employed as a set of unimodal encoders to extract features specific to each modality. Subsequently, the feature space associated with the visible spectrum undergoes transformation into BEV. The fusion of these two modalities is facilitated via concatenation. At the same time, the ultrasonic spectrum-based unimodal feature maps pass through content-aware dilated convolution, applied to mitigate the sensor misalignment between two sensors in the fused feature space. Finally, the fused features are utilized by a two-stage semantic occupancy decoder to generate grid-wise predictions for precise obstacle perception. We conduct a systematic investigation to determine the optimal strategy for multimodal fusion of both sensors. We provide insights into our dataset creation procedures, annotation guidelines, and perform a thorough data analysis to ensure adequate coverage of all scenarios. When applied to our dataset, the experimental results underscore the robustness and effectiveness of our proposed multimodal fusion approach.

Fisheye Camera and Ultrasonic Sensor Fusion For Near-Field Obstacle Perception in Bird's-Eye-View

TL;DR

This work tackles near-field obstacle perception by fusing a fisheye camera with an ultrasonic sensor in a bird's-eye-view (BEV) representation. It introduces an end-to-end CNN model with unimodal encoders, a BEV projection pipeline, and the CaMFuse fusion module that uses content-aware dilation to mitigate sensor misalignment, followed by a two-stage semantic occupancy decoder. A custom multisensor dataset with BEV ground truth is created, annotated, and analyzed, demonstrating that multimodal fusion outperforms unimodal baselines across indoor/outdoor scenes, obstacle distances, and motion conditions. The approach advances practical surround-view perception by leveraging complementary sensor strengths under challenging lighting and glare, with implications for safer, robust ADAS and autonomous driving systems.

Abstract

Accurate obstacle identification represents a fundamental challenge within the scope of near-field perception for autonomous driving. Conventionally, fisheye cameras are frequently employed for comprehensive surround-view perception, including rear-view obstacle localization. However, the performance of such cameras can significantly deteriorate in low-light conditions, during nighttime, or when subjected to intense sun glare. Conversely, cost-effective sensors like ultrasonic sensors remain largely unaffected under these conditions. Therefore, we present, to our knowledge, the first end-to-end multimodal fusion model tailored for efficient obstacle perception in a bird's-eye-view (BEV) perspective, utilizing fisheye cameras and ultrasonic sensors. Initially, ResNeXt-50 is employed as a set of unimodal encoders to extract features specific to each modality. Subsequently, the feature space associated with the visible spectrum undergoes transformation into BEV. The fusion of these two modalities is facilitated via concatenation. At the same time, the ultrasonic spectrum-based unimodal feature maps pass through content-aware dilated convolution, applied to mitigate the sensor misalignment between two sensors in the fused feature space. Finally, the fused features are utilized by a two-stage semantic occupancy decoder to generate grid-wise predictions for precise obstacle perception. We conduct a systematic investigation to determine the optimal strategy for multimodal fusion of both sensors. We provide insights into our dataset creation procedures, annotation guidelines, and perform a thorough data analysis to ensure adequate coverage of all scenarios. When applied to our dataset, the experimental results underscore the robustness and effectiveness of our proposed multimodal fusion approach.
Paper Structure (29 sections, 12 equations, 12 figures, 6 tables)

This paper contains 29 sections, 12 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: A standard autonomous driving pipeline uricar2019yes
  • Figure 2: (a) Mounting positions for the ultrasonic sensors and the fisheye camera. The twelve ultrasonic sensors are shown as grey boxes on the outline of the car, and the rear fisheye camera is shown as a pink dot at the rear of the car, both symbols are not to scale with the actual sensor. The coordinate system corresponds to ISO 8855. (b) Schematic of an ultrasonic grid map filling step for one exemplary grid cell and one exemplary signalway. The grid is not to scale. The signal was emitted by Sensor $S1$ and is received by Sensor $S2$. To get the echo amplitude value at the highlighted grid cell, the distances $d1$ and $d2$ as well as the angles $\alpha1$ and $\alpha2$ used for amplitude attenuation are determined.
  • Figure 3: Various automotive sensors eising2021near used in a typical perception stack in either unimodal or multimodal settings.
  • Figure 4: Different types of obstacles commonly appear in rear-view. Top: vehicle, pedestrian, carton. Bottom: pillar, cycle, wooden box.
  • Figure 5: Estimated Field of View of the ultrasonic sensor system. The actual field of view also depends on the object being observed. Echos are strongest on the sensor axis and weaken for objects positioned off-axis. (Image courtesy of Nathaniel Arnest, DSW Kronach, nathaniel.arnest@valeo.com.)
  • ...and 7 more figures