Table of Contents
Fetching ...

Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, Alois Knoll

TL;DR

This work introduces CoHFF, a collaborative framework for 3D semantic occupancy prediction in connected automated vehicles that fuses intra- and inter-vehicle features via orthogonal-plane communications to deliver dense, semantic voxel maps. It combines two pre-trained task networks (occupancy prediction and semantic segmentation) with a hybrid fusion scheme that uses V2X plane features and deformable self-attention to produce accurate voxel-level semantics under a communication budget. The authors extend the OPV2V dataset with 12 semantic occupancy labels and demonstrate that collaboration improves semantic occupancy by over 30% and can match or exceed state-of-the-art downstream BEV/detection performance, while remaining robust under limited bandwidth. The work highlights the potential of plane-based feature sharing and task-aware fusion for richer scene understanding in multi-agent driving scenarios, and outlines paths toward real-world validation and dataset expansion.

Abstract

Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.

Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles

TL;DR

This work introduces CoHFF, a collaborative framework for 3D semantic occupancy prediction in connected automated vehicles that fuses intra- and inter-vehicle features via orthogonal-plane communications to deliver dense, semantic voxel maps. It combines two pre-trained task networks (occupancy prediction and semantic segmentation) with a hybrid fusion scheme that uses V2X plane features and deformable self-attention to produce accurate voxel-level semantics under a communication budget. The authors extend the OPV2V dataset with 12 semantic occupancy labels and demonstrate that collaboration improves semantic occupancy by over 30% and can match or exceed state-of-the-art downstream BEV/detection performance, while remaining robust under limited bandwidth. The work highlights the potential of plane-based feature sharing and task-aware fusion for richer scene understanding in multi-agent driving scenarios, and outlines paths toward real-world validation and dataset expansion.

Abstract

Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.
Paper Structure (24 sections, 5 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 5 equations, 11 figures, 5 tables, 1 algorithm.

Figures (11)

  • Figure 1: Collaborative semantic occupancy prediction leverages the power of collaboration in multi-agent systems for 3D occupancy prediction and semantic segmentation. This approach enables a deeper understanding of the 3D road environment by sharing latent features among connected automated vehicles (CAVs), surpassing the ground truth captured by a multi-camera system in the ego vehicle.
  • Figure 2: The CoHFF Framework consists of four key modules: (1) Occupancy Prediction Task Net, for occupancy feature extraction; (2) Semantic Segmentation Task Net, creating semantic plane-based embeddings; (3) V2X Feature Fusion, merging CAV features via deformable self-attention; and (4) Task Feature Fusion, uniting all task features to enhance semantic occupancy prediction.
  • Figure 3: Illustration of collaborative semantic occupancy prediction from multiple perspectives, compared to the ground truth in the ego vehicle's FoV and the collaborative FoV across CAVs. This visualization emphasizes the advanced object detection capabilities in collaborative settings, particularly for objects obscured in the ego vehicle's FoV, such as the vehicle with ID 6.
  • Figure 4: Visualization of semantic point clouds from 4 semantic LiDARs in Ego vehicle and CAVs.
  • Figure 5: Visualization of semantic point clouds from 18 semantic LiDARs.
  • ...and 6 more figures