Table of Contents
Fetching ...

TFusionOcc: Student's t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Zhenxing Ming, Julie Stephany Berrio, Mao Shan, Stewart Worrall

TL;DR

TFusionOcc tackles robust 3D semantic occupancy prediction for autonomous driving by introducing an object-centric, multi-stage multi-sensor fusion framework that leverages the Student's t-distribution and T-mixture models. The method uses deformable superquadric primitives to flexibly capture geometry, a skeleton-merge scheme to fuse LiDAR and surround-view camera data, and a Transformer-based refinement to produce dense 3D occupancy via splatting. Key contributions include the MGCAFusion module, deformable T-primitives (including inverse-warp variants), and a depth-guided 3D deformable attention mechanism that achieves SOTA results on nuScenes and demonstrates strong robustness on nuScenes-C under various corruptions. The approach offers improved geometric detail, robustness to outliers, and practical scalability for edge deployment, with extensive ablations and efficiency analysis supporting its effectiveness.

Abstract

3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model's ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student's t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc

TFusionOcc: Student's t-Distribution Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

TL;DR

TFusionOcc tackles robust 3D semantic occupancy prediction for autonomous driving by introducing an object-centric, multi-stage multi-sensor fusion framework that leverages the Student's t-distribution and T-mixture models. The method uses deformable superquadric primitives to flexibly capture geometry, a skeleton-merge scheme to fuse LiDAR and surround-view camera data, and a Transformer-based refinement to produce dense 3D occupancy via splatting. Key contributions include the MGCAFusion module, deformable T-primitives (including inverse-warp variants), and a depth-guided 3D deformable attention mechanism that achieves SOTA results on nuScenes and demonstrates strong robustness on nuScenes-C under various corruptions. The approach offers improved geometric detail, robustness to outliers, and practical scalability for edge deployment, with extensive ablations and efficiency analysis supporting its effectiveness.

Abstract

3D semantic occupancy prediction enables autonomous vehicles (AVs) to perceive fine-grained geometric and semantic structure of their surroundings from onboard sensors, which is essential for safe decision-making and navigation. Recent models for 3D semantic occupancy prediction have successfully addressed the challenge of describing real-world objects with varied shapes and classes. However, the intermediate representations used by existing methods for 3D semantic occupancy prediction rely heavily on 3D voxel volumes or a set of 3D Gaussians, hindering the model's ability to efficiently and effectively capture fine-grained geometric details in the 3D driving environment. This paper introduces TFusionOcc, a novel object-centric multi-sensor fusion framework for predicting 3D semantic occupancy. By leveraging multi-stage multi-sensor fusion, Student's t-distribution, and the T-Mixture model (TMM), together with more geometrically flexible primitives, such as the deformable superquadric (superquadric with inverse warp), the proposed method achieved state-of-the-art (SOTA) performance on the nuScenes benchmark. In addition, extensive experiments were conducted on the nuScenes-C dataset to demonstrate the robustness of the proposed method in different camera and lidar corruption scenarios. The code will be available at: https://github.com/DanielMing123/TFusionOcc
Paper Structure (42 sections, 22 equations, 14 figures, 16 tables)

This paper contains 42 sections, 22 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Pipeline of three approaches: Voxel-based approach (top-left), 3D-Gaussian-Primitive-based object-centric approach (top-right), and our approach (bottom-left).
  • Figure 2: Overall architecture of TFusionOcc. The pipeline comprises two different modality branches and a multi-stage feature fusion branch. The camera branch extracts multi-scale visual features and predicts a pseudo 3D point cloud from surround-view images. The pseudo 3D point cloud is further projected and cylindrically partitioned, resulting in camera-based, multi-scale, dense depth maps and a voxel volume defined under cylindrical coordinates. The lidar branch applies a cylindrical partition followed by a 3D encoder to extract the lidar feature. Meanwhile, the lidar point cloud is also projected to generate lidar-based, multi-scale, sparse depth maps. The feature fusion branch adopts a multi-stage fusion strategy to merge all outputs from the two-modality branch and leverages a proposed transformer to refine the T-primitives property through fused features.
  • Figure 3: Inner Structure of DepthNet. The 1/8-scale visual features are first used to generate a 1/8-scale depth map. Then, bilinear interpolation is leveraged to generate 1/16- and 1/32-scale depth maps. Meanwhile, the image-based pseudo-point cloud is generated solely from a 1/8-scale depth map.
  • Figure 4: Skeleton Merge Module. The upper lidar-branch serves as the main skeleton to provide a foundation structure for the 3D scene, and the bottom camera-branch serves as an augmentation based on the main skeleton to provide more detailed local structure to compensate for the fine-grained geometry of the main skeleton.
  • Figure 5: Multi-Scale Fused Dense Depth Maps Generation. The upper part exhibits general multi-modality depth map fusion from a depth map perspective, and the bottom part demonstrates the detailed multi-modality depth map fusion from a per-pixel ray perspective.
  • ...and 9 more figures