Table of Contents
Fetching ...

Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

Cheng Chen, Hao Huang, Saurabh Bagchi

TL;DR

The paper addresses vision-only collaborative 3D semantic occupancy prediction by introducing sparse 3D semantic Gaussian primitives as the communication medium. It formalizes cross-agent fusion: each agent encodes scene geometry and semantics into Gaussian primitives, transmits only those within a region of interest after rigid alignment, and uses a neighborhood-based fusion module to refine the ego set before Gaussian-to-voxel splatting renders semantic occupancy. The approach yields substantial improvements over single-agent methods and prior collaborative SOP (e.g., +8.42 and +3.28 mIoU; +5.11 and +22.41 IoU) while significantly reducing communication volume (down to 34.6%). These findings demonstrate that explicit 3D Gaussians offer robust, bandwidth-efficient cross-agent fusion for vision-only SOP in urban driving scenarios.

Abstract

Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

Vision-Only Gaussian Splatting for Collaborative Semantic Occupancy Prediction

TL;DR

The paper addresses vision-only collaborative 3D semantic occupancy prediction by introducing sparse 3D semantic Gaussian primitives as the communication medium. It formalizes cross-agent fusion: each agent encodes scene geometry and semantics into Gaussian primitives, transmits only those within a region of interest after rigid alignment, and uses a neighborhood-based fusion module to refine the ego set before Gaussian-to-voxel splatting renders semantic occupancy. The approach yields substantial improvements over single-agent methods and prior collaborative SOP (e.g., +8.42 and +3.28 mIoU; +5.11 and +22.41 IoU) while significantly reducing communication volume (down to 34.6%). These findings demonstrate that explicit 3D Gaussians offer robust, bandwidth-efficient cross-agent fusion for vision-only SOP in urban driving scenarios.

Abstract

Collaborative perception enables connected vehicles to share information, overcoming occlusions and extending the limited sensing range inherent in single-agent (non-collaborative) systems. Existing vision-only methods for 3D semantic occupancy prediction commonly rely on dense 3D voxels, which incur high communication costs, or 2D planar features, which require accurate depth estimation or additional supervision, limiting their applicability to collaborative scenarios. To address these challenges, we propose the first approach leveraging sparse 3D semantic Gaussian splatting for collaborative 3D semantic occupancy prediction. By sharing and fusing intermediate Gaussian primitives, our method provides three benefits: a neighborhood-based cross-agent fusion that removes duplicates and suppresses noisy or inconsistent Gaussians; a joint encoding of geometry and semantics in each primitive, which reduces reliance on depth supervision and allows simple rigid alignment; and sparse, object-centric messages that preserve structural information while reducing communication volume. Extensive experiments demonstrate that our approach outperforms single-agent perception and baseline collaborative methods by +8.42 and +3.28 points in mIoU, and +5.11 and +22.41 points in IoU, respectively. When further reducing the number of transmitted Gaussians, our method still achieves a +1.9 improvement in mIoU, using only 34.6% communication volume, highlighting robust performance under limited communication budgets.

Paper Structure

This paper contains 29 sections, 17 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Comparison of shared representations. BEV and tri-plane methods transmit implicit planar features as messages, losing geometric detail and complicating alignment. We instead share explicit, interpretable 3D Gaussian primitives that preserve 3D structure and enable straightforward cross-agent fusion.
  • Figure 2: Overview of the proposed pipeline. An initial set of randomly initialized 3D Gaussians is refined by an Image-to-Gaussian module huang2024gaussianformer that attends to multi-scale image features, producing single-agent Gaussians. Neighbor agents (top and bottom) are rigidly transformed into the ego frame (middle) and culled to the ego region of interest; the blue and yellow dashed box marks the Gaussians that lie within the ego ROI and are packaged and transmitted to the ego. A cross-agent Gaussian fusion module aggregates these with the ego set. The fused Gaussians are then rendered to semantic occupancy via Gaussian-to-voxel splatting huang2024gaussianformer. For clarity, the figure shows the zero-shot variant.
  • Figure 3: Qualitative comparison of ego-only ground truth, collaborative ground truth, predicted Gaussians, predicted occupancy, and the zero-shot variant. Red boxes highlight occluded object structure captured by Gaussian primitives in collaborative setting. The zero-shot variant can look plausible but often shows clustered redundancy and noise; black boxes mark cases where the neighborhood-based fusion suppresses redundancy and improves consistency. An opacity threshold is applied for display, so the predicted Gaussians are not exhaustive.