Table of Contents
Fetching ...

OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

Ji Zhang, Yiran Ding, Zixin Liu

TL;DR

OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction, is proposed, which introduces a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead.

Abstract

3D semantic occupancy prediction is crucial for finely representing the surrounding environment, which is essential for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features, followed by computationally intensive 3D operations to fuse these with LiDAR features, leading to high computational costs and reduced accuracy. Moreover, current research on occupancy prediction predominantly focuses on designing specific network architectures, often tailored to particular models, with limited attention given to the more fundamental aspect of semantic feature learning. This gap hinders the development of more transferable methods that could enhance the performance of various occupancy models. To address these challenges, we propose OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction. Specifically, we introduce a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead. Additionally, we propose a transferable proxy-based loss function and an adaptive hard sample weighting algorithm, which enhance the performance of several state-of-the-art methods. Extensive evaluations on the nuScenes and SemanticKITTI benchmarks demonstrate the superiority of our framework, and ablation studies confirm the effectiveness of each proposed module.

OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction

TL;DR

OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction, is proposed, which introduces a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead.

Abstract

3D semantic occupancy prediction is crucial for finely representing the surrounding environment, which is essential for ensuring the safety in autonomous driving. Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features, followed by computationally intensive 3D operations to fuse these with LiDAR features, leading to high computational costs and reduced accuracy. Moreover, current research on occupancy prediction predominantly focuses on designing specific network architectures, often tailored to particular models, with limited attention given to the more fundamental aspect of semantic feature learning. This gap hinders the development of more transferable methods that could enhance the performance of various occupancy models. To address these challenges, we propose OccLoff, a framework that Learns to Optimize Feature Fusion for 3D occupancy prediction. Specifically, we introduce a sparse fusion encoder with entropy masks that directly fuses 3D and 2D features, improving model accuracy while reducing computational overhead. Additionally, we propose a transferable proxy-based loss function and an adaptive hard sample weighting algorithm, which enhance the performance of several state-of-the-art methods. Extensive evaluations on the nuScenes and SemanticKITTI benchmarks demonstrate the superiority of our framework, and ablation studies confirm the effectiveness of each proposed module.

Paper Structure

This paper contains 12 sections, 12 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: A comparison of the impact of two different image feature processing paradigms on multi-modal fusion. The lifting paradigm requires additional modules to lift image features into 3D space before fusing them through 3D operations, which leads to high computational costs and can introduce additional noise (e.g., errors from depth estimation). In contrast, the querying from 3D to 2D approach performs feature fusion in a single step, making it more robust (see \ref{['subsec:encoder']} for details).
  • Figure 2: Our OccLoff framework. The sparse fusion encoder first performs query proposal through an entropy mask, then fuses the selected LiDAR features with surrounding multi-scale image features. SCA represents Spatial Cross Attention, where the geometric-aware SCA fuses low-resolution deep features, and the semantic-aware SCA fuses high-resolution shallow features. The temporal encoder integrates multi-frame features to enhance robustness. During the training phase, each sample is weighted based on its difficulty, and the occupancy proxy loss helps obtain more distinctive occupancy features. See \ref{['sec:method']} for details.
  • Figure 3: Visualization of the performance comparison between our method and existing state-of-the-art multi-modal occupancy methods on nuScenes-Occupancy iccv01. Our method consistently outperforms other approaches. Better viewed when zoomed in.