BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud
Mong H. Ng, Kaahan Radia, Jianfei Chen, Dequan Wang, Ionel Gog, Joseph E. Gonzalez
TL;DR
The paper tackles bird's-eye-view semantic segmentation from side-view RGB images by introducing a two-stage, depth-aware perception pipeline that first converts per-view depth and semantics into a semantic point cloud and then parses an incomplete BEV into full BEV semantics. A common intermediate representation enables effective transfer learning across domains by abstracting geometry while stage 1 remains adaptable to target domains. The approach achieves state-of-the-art results, demonstrating large mIoU gains over prior methods and robust transfer when fine-tuning only the first stage, aided by the new BEVSEG-Carla dataset. This work enhances monocular BEV understanding for autonomous driving and offers practical benefits for cross-domain deployment and planning systems.
Abstract
Bird's-eye-view (BEV) is a powerful and widely adopted representation for road scenes that captures surrounding objects and their spatial locations, along with overall context in the scene. In this work, we focus on bird's eye semantic segmentation, a task that predicts pixel-wise semantic segmentation in BEV from side RGB images. This task is made possible by simulators such as Carla, which allow for cheap data collection, arbitrary camera placements, and supervision in ways otherwise not possible in the real world. There are two main challenges to this task: the view transformation from side view to bird's eye view, as well as transfer learning to unseen domains. Existing work transforms between views through fully connected layers and transfer learns via GANs. This suffers from a lack of depth reasoning and performance degradation across domains. Our novel 2-staged perception pipeline explicitly predicts pixel depths and combines them with pixel semantics in an efficient manner, allowing the model to leverage depth information to infer objects' spatial locations in the BEV. In addition, we transfer learning by abstracting high-level geometric features and predicting an intermediate representation that is common across different domains. We publish a new dataset called BEVSEG-Carla and show that our approach improves state-of-the-art by 24% mIoU and performs well when transferred to a new domain.
