A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving
Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada
TL;DR
The paper tackles the limits of single-vehicle perception due to occlusions and restricted range by introducing Co3SOP, a synthetic benchmark for collaborative 3D semantic occupancy prediction created in CARLA with dense voxel-level ground truth. It defines a baseline model, Co3SOP-Base, that fuses inter-agent voxel features through 3D affine alignment and voxel deformable attention with a confidence mask, enabling robust multi-agent occupancy prediction. The study shows that collaboration consistently improves semantic occupancy performance across multiple ranges, with larger gains at extended ranges, and provides extensive ablations on collaboration extent and pose noise robustness. By supplying a dense, voxel-level collaborative dataset and a strong baseline, the work enables future research to advance accurate, multi-agent 3D scene understanding for safer autonomous driving.
Abstract
3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.
