Table of Contents
Fetching ...

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

TL;DR

This approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach.

Abstract

Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

TL;DR

This approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach.

Abstract

Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at https://lags.cs.uni-freiburg.de/.
Paper Structure (19 sections, 3 equations, 4 figures, 6 tables)

This paper contains 19 sections, 3 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of our latent Gaussian representation. Bottom left: panoptic voxel predictions. Center: latent Gaussian features colorized via principal component analysis, splatted to 2D, and overlaid over voxel predictions. Top right: latent Gaussians.
  • Figure 2: Architecture overview (left to right). An image encoder ( ) produces image features $\bm{F}$ and depth $\bm{D}$. Features are lifted via depth to a 3D volume and further encoded into a 3D voxel feature pyramid ($\bm{V}_0$, $\bm{V}_2$, ). Our latent Gaussian encoder ( ) samples points from the pyramid volumes and processes them in a coarse (left) and fine (right) stream, using self-attention (SA), windowed self-attention (WSA), cross-attention (CA), spatial cross-attention (SCA), and feed-forward networks (FFN). Our novel Serialized Multi-Stream Attention (SMSA) facilitates information exchange between streams. Refined points are decoded as Gaussians ($\bm{G}_0$, $\bm{G}_2$) and splatted back to a 3D feature volume, which is then refined to the final voxel volume $\bm{V}$. Our transformer decoder ( ) then decodes this volume into semantic and instance masks using volume cross-attention (VCA) for efficient query-to-3D-volume attention. Tracking is facilitated by the tracking-by-attention paradigm. We refine track queries by spatio-temporal reasoning before passing them onto the next frame ( ).
  • Figure 3: Qualitative results on the Occ3D-nuScenes validation split. Our approach shows clear improvements in (1) instance separation, (2) instance association, (3) missing detections, and (4) underconfident detections.
  • Figure 4: Qualitative results on the Occ3D-Waymo validation split. Our approach shows clear improvements in (1) instance association, (2) instance separation, and (3) ID switches.