Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles
Rui Song, Chenwei Liang, Hu Cao, Zhiran Yan, Walter Zimmer, Markus Gross, Andreas Festag, Alois Knoll
TL;DR
This work introduces CoHFF, a collaborative framework for 3D semantic occupancy prediction in connected automated vehicles that fuses intra- and inter-vehicle features via orthogonal-plane communications to deliver dense, semantic voxel maps. It combines two pre-trained task networks (occupancy prediction and semantic segmentation) with a hybrid fusion scheme that uses V2X plane features and deformable self-attention to produce accurate voxel-level semantics under a communication budget. The authors extend the OPV2V dataset with 12 semantic occupancy labels and demonstrate that collaboration improves semantic occupancy by over 30% and can match or exceed state-of-the-art downstream BEV/detection performance, while remaining robust under limited bandwidth. The work highlights the potential of plane-based feature sharing and task-aware fusion for richer scene understanding in multi-agent driving scenarios, and outlines paths toward real-world validation and dataset expansion.
Abstract
Collaborative perception in automated vehicles leverages the exchange of information between agents, aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However, these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap, we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly, it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features, and (ii) compressed orthogonal attention features shared between vehicles. Additionally, due to the lack of a collaborative perception dataset designed for semantic occupancy prediction, we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30%, and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications, showcasing enhanced accuracy and enriched semantic-awareness in road environments.
