Table of Contents
Fetching ...

DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird's Eye View Segmentation with Occlusion Reasoning

Senthil Yogamani, David Unger, Venkatraman Narayanan, Varun Ravi Kumar

TL;DR

DaF-BEVSeg tackles semantic BEV segmentation for surround-view fisheye cameras by introducing distortion-aware pooling and occlusion reasoning within a Learn-Splat-Shoot–style BEV framework. It extends BEV generation to fisheye imagery, deriving direction vectors from radial distortion models and fusing multi-view features with learnable, camera-aware pooling while simultaneously predicting per-cell occupancy to handle occlusions. Evaluations on a Cognata synthetic dataset show that avoiding image undistortion reduces artifacts and runtime and that the proposed pooling and occlusion modules improve mIoU and occlusion handling compared to rectified baselines. The approach enables robust near-field BEV perception for fisheye camera rigs and suggests promising directions for bridging synthetic-to-real gaps in autonomous driving perception.

Abstract

Semantic segmentation is an effective way to perform scene understanding. Recently, segmentation in 3D Bird's Eye View (BEV) space has become popular as its directly used by drive policy. However, there is limited work on BEV segmentation for surround-view fisheye cameras, commonly used in commercial vehicles. As this task has no real-world public dataset and existing synthetic datasets do not handle amodal regions due to occlusion, we create a synthetic dataset using the Cognata simulator comprising diverse road types, weather, and lighting conditions. We generalize the BEV segmentation to work with any camera model; this is useful for mixing diverse cameras. We implement a baseline by applying cylindrical rectification on the fisheye images and using a standard LSS-based BEV segmentation model. We demonstrate that we can achieve better performance without undistortion, which has the adverse effects of increased runtime due to pre-processing, reduced field-of-view, and resampling artifacts. Further, we introduce a distortion-aware learnable BEV pooling strategy that is more effective for the fisheye cameras. We extend the model with an occlusion reasoning module, which is critical for estimating in BEV space. Qualitative performance of DaF-BEVSeg is showcased in the video at https://streamable.com/ge4v51.

DaF-BEVSeg: Distortion-aware Fisheye Camera based Bird's Eye View Segmentation with Occlusion Reasoning

TL;DR

DaF-BEVSeg tackles semantic BEV segmentation for surround-view fisheye cameras by introducing distortion-aware pooling and occlusion reasoning within a Learn-Splat-Shoot–style BEV framework. It extends BEV generation to fisheye imagery, deriving direction vectors from radial distortion models and fusing multi-view features with learnable, camera-aware pooling while simultaneously predicting per-cell occupancy to handle occlusions. Evaluations on a Cognata synthetic dataset show that avoiding image undistortion reduces artifacts and runtime and that the proposed pooling and occlusion modules improve mIoU and occlusion handling compared to rectified baselines. The approach enables robust near-field BEV perception for fisheye camera rigs and suggests promising directions for bridging synthetic-to-real gaps in autonomous driving perception.

Abstract

Semantic segmentation is an effective way to perform scene understanding. Recently, segmentation in 3D Bird's Eye View (BEV) space has become popular as its directly used by drive policy. However, there is limited work on BEV segmentation for surround-view fisheye cameras, commonly used in commercial vehicles. As this task has no real-world public dataset and existing synthetic datasets do not handle amodal regions due to occlusion, we create a synthetic dataset using the Cognata simulator comprising diverse road types, weather, and lighting conditions. We generalize the BEV segmentation to work with any camera model; this is useful for mixing diverse cameras. We implement a baseline by applying cylindrical rectification on the fisheye images and using a standard LSS-based BEV segmentation model. We demonstrate that we can achieve better performance without undistortion, which has the adverse effects of increased runtime due to pre-processing, reduced field-of-view, and resampling artifacts. Further, we introduce a distortion-aware learnable BEV pooling strategy that is more effective for the fisheye cameras. We extend the model with an occlusion reasoning module, which is critical for estimating in BEV space. Qualitative performance of DaF-BEVSeg is showcased in the video at https://streamable.com/ge4v51.
Paper Structure (19 sections, 9 equations, 7 figures, 2 tables)

This paper contains 19 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Semantic encoding of the BEV space. Left: BEV camera image. Right: Scene's semantic representation.
  • Figure 2: The visulization of BEV Grid space from each camera pixel. On the Left is camera BEV grid sample from NuScenes dataset caesar2020nuscenes color-coded as FRONT, FRONT-RIGHT, BACK-RIGHT, BACK, BACK-LEFT, FRONT-LEFT. On the Right is camera BEV grid sample from our Cognata dataset color-coded as FRONT, RIGHT, BACK, LEFT.
  • Figure 3: The network architecture converts images into image features, transforms the features into BEV space using the calibration data, and then calculates the semantic output in BEV space.
  • Figure 4: Row 1: Four different scenes from the Cognata training set. Rows 2-4: 12 different weather and traffic situations from the fifth training scene
  • Figure 5: DaF-BEVSeg results on Easy scene. Left: The 4x surround camera images. Top right: The ground truth. Bottom right: Model Prediction.
  • ...and 2 more figures