Table of Contents
Fetching ...

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

Jian Sun, Yuqi Dai, Chi-Man Vong, Qing Xu, Shengbo Eben Li, Jianqiang Wang, Lei He, Keqiang Li

TL;DR

OE-BevSeg addresses two core challenges in BEV vehicle segmentation: capturing long-range environmental context and resolving fine-grained target-object details. It combines an Environment-aware BEV Compressor with a Bi-Surround Scan and a Center-Informed Object Enhancement module, plus a multimodal fusion path that integrates RGB with radar/LiDAR features. The approach achieves state-of-the-art results on nuScenes for camera-only and multimodal BEV segmentation and demonstrates robust performance across distance, weather, and occlusion scenarios. These contributions advance reliable, scalable BEV perception for autonomous driving with practical benefits for planning and safety.

Abstract

Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. It realizes ego-vehicle surrounding environment perception by projecting 2D multi-view images into 3D world space. Recently, BEV segmentation has made notable progress, attributed to better view transformation modules, larger image encoders, or more temporal information. However, there are still two issues: 1) a lack of effective understanding and enhancement of BEV space features, particularly in accurately capturing long-distance environmental features and 2) recognizing fine details of target objects. To address these issues, we propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance through global environment-aware perception and local target object enhancement. OE-BevSeg employs an environment-aware BEV compressor. Based on prior knowledge about the main composition of the BEV surrounding environment varying with the increase of distance intervals, long-sequence global modeling is utilized to improve the model's understanding and perception of the environment. From the perspective of enriching target object information in segmentation results, we introduce the center-informed object enhancement module, using centerness information to supervise and guide the segmentation head, thereby enhancing segmentation performance from a local enhancement perspective. Additionally, we designed a multimodal fusion branch that integrates multi-view RGB image features with radar/LiDAR features, achieving significant performance improvements. Extensive experiments show that, whether in camera-only or multimodal fusion BEV segmentation tasks, our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation, demonstrating superior applicability in the field of autonomous driving.

OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation

TL;DR

OE-BevSeg addresses two core challenges in BEV vehicle segmentation: capturing long-range environmental context and resolving fine-grained target-object details. It combines an Environment-aware BEV Compressor with a Bi-Surround Scan and a Center-Informed Object Enhancement module, plus a multimodal fusion path that integrates RGB with radar/LiDAR features. The approach achieves state-of-the-art results on nuScenes for camera-only and multimodal BEV segmentation and demonstrates robust performance across distance, weather, and occlusion scenarios. These contributions advance reliable, scalable BEV perception for autonomous driving with practical benefits for planning and safety.

Abstract

Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. It realizes ego-vehicle surrounding environment perception by projecting 2D multi-view images into 3D world space. Recently, BEV segmentation has made notable progress, attributed to better view transformation modules, larger image encoders, or more temporal information. However, there are still two issues: 1) a lack of effective understanding and enhancement of BEV space features, particularly in accurately capturing long-distance environmental features and 2) recognizing fine details of target objects. To address these issues, we propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance through global environment-aware perception and local target object enhancement. OE-BevSeg employs an environment-aware BEV compressor. Based on prior knowledge about the main composition of the BEV surrounding environment varying with the increase of distance intervals, long-sequence global modeling is utilized to improve the model's understanding and perception of the environment. From the perspective of enriching target object information in segmentation results, we introduce the center-informed object enhancement module, using centerness information to supervise and guide the segmentation head, thereby enhancing segmentation performance from a local enhancement perspective. Additionally, we designed a multimodal fusion branch that integrates multi-view RGB image features with radar/LiDAR features, achieving significant performance improvements. Extensive experiments show that, whether in camera-only or multimodal fusion BEV segmentation tasks, our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation, demonstrating superior applicability in the field of autonomous driving.
Paper Structure (23 sections, 19 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 19 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The pipline of our proposed OE-BevSeg. The BEV feature is enhanced in terms of both environment and target object perspectives to improve the performance of segmentation.
  • Figure 2: The pipline of our proposed OE-BevSeg. The input consists of two parts: surround-view RGB images $I_{s}$ and radar/LIDAR point clouds $P$. The encoder extracts multi-view features $\left \{ F_{p,k} \right \} _{k=1}^{K}$ in the PV space. Through parameter-free perspective transformation, $\left \{ F_{p,k} \right \} _{k=1}^{K}$ are lifted from 2D features to 3D features $F_{3d}$. The preprocessed point cloud features $F_{p}$ are then fused with $F_{3d}$ for multimodal fusion. The fused features $F_{f}$ are processed through our designed EBC and CIOE modules. We use a multi-task head to handle the Decoder's output. Besides segmentation, our model also predicts centerness and offset for regularization.
  • Figure 3: The overall network backbone of our proposed OE-BevSeg, which can be mainly divided into two parts: the Encoder and the Decoder. The Encoder part is primarily composed of OSA blocks, while the Decoder part is mainly composed of ResNet layers. After the Encoder, a series of BEV operations are performed to complete the perspective transformation.
  • Figure 4: The architecture of our proposed Environment-aware BEV Compressor (EBC). The input is BEV feature, which is processed for bi-surround scan.
  • Figure 5: The architecture of our proposed Center-Informed Object Enhancement (CIOE) module. We use vehicle instance centerness in both PV space and BEV space to enrich the detail target object information in the segmentation results.
  • ...and 4 more figures