Table of Contents
Fetching ...

A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving

Hanlin Wu, Pengfei Lin, Ehsan Javanmardi, Naren Bao, Bo Qian, Hao Si, Manabu Tsukada

TL;DR

The paper tackles the limits of single-vehicle perception due to occlusions and restricted range by introducing Co3SOP, a synthetic benchmark for collaborative 3D semantic occupancy prediction created in CARLA with dense voxel-level ground truth. It defines a baseline model, Co3SOP-Base, that fuses inter-agent voxel features through 3D affine alignment and voxel deformable attention with a confidence mask, enabling robust multi-agent occupancy prediction. The study shows that collaboration consistently improves semantic occupancy performance across multiple ranges, with larger gains at extended ranges, and provides extensive ablations on collaboration extent and pose noise robustness. By supplying a dense, voxel-level collaborative dataset and a strong baseline, the work enables future research to advance accurate, multi-agent 3D scene understanding for safer autonomous driving.

Abstract

3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.

A Synthetic Benchmark for Collaborative 3D Semantic Occupancy Prediction in V2X Autonomous Driving

TL;DR

The paper tackles the limits of single-vehicle perception due to occlusions and restricted range by introducing Co3SOP, a synthetic benchmark for collaborative 3D semantic occupancy prediction created in CARLA with dense voxel-level ground truth. It defines a baseline model, Co3SOP-Base, that fuses inter-agent voxel features through 3D affine alignment and voxel deformable attention with a confidence mask, enabling robust multi-agent occupancy prediction. The study shows that collaboration consistently improves semantic occupancy performance across multiple ranges, with larger gains at extended ranges, and provides extensive ablations on collaboration extent and pose noise robustness. By supplying a dense, voxel-level collaborative dataset and a strong baseline, the work enables future research to advance accurate, multi-agent 3D scene understanding for safer autonomous driving.

Abstract

3D semantic occupancy prediction is an emerging perception paradigm in autonomous driving, providing a voxel-level representation of both geometric details and semantic categories. However, the perception capability of a single vehicle is inherently constrained by occlusion, restricted sensor range, and narrow viewpoints. To address these limitations, collaborative perception enables the exchange of complementary information, thereby enhancing the completeness and accuracy. In the absence of a dedicated dataset for collaborative 3D semantic occupancy prediction, we augment an existing collaborative perception dataset by replaying it in CARLA with a high-resolution semantic voxel sensor to provide dense and comprehensive occupancy annotations. In addition, we establish benchmarks with varying prediction ranges designed to systematically assess the impact of spatial extent on collaborative prediction. We further develop a baseline model that performs inter-agent feature fusion via spatial alignment and attention aggregation. Experimental results demonstrate that our baseline model consistently outperforms single-agent models, with increasing gains observed as the prediction range expands.

Paper Structure

This paper contains 17 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the annotation pipeline for 3D semantic voxel using a custom sensor in Carla, including voxel configuration, top-down broad range box trace, object based occupancy completion and label assignment.
  • Figure 2: Illustration of the V2V scenarios in Carla and the corresponding data collection results. Left: The screen shot of two V2V scenarios in Carla based on the settings in OPV2V. Mid: LiDAR generated 3D semantic voxel annotations. Right: The annotations collected by our developed 3D semantic voxel sensor.
  • Figure 3: The Collaborative 3D Semantic Occupancy Prediction Baseline (Co3SOP-Base) consists of two pipelines: (1) ego prediction pipeline, including image backbone, image deformable attention and prediction head; (2) V2V feature fusion pipeline, including 3D affine transformation and mask voxel deformable attention.
  • Figure 4: Ablation study on the number of collaborating vehicles used for feature fusion.