EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

Yining Shi; Kun Jiang; Jinyu Miao; Ke Wang; Kangan Qian; Yunlong Wang; Jiusi Li; Tuopu Wen; Mengmeng Yang; Yiliang Xu; Diange Yang

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

Yining Shi, Kun Jiang, Jinyu Miao, Ke Wang, Kangan Qian, Yunlong Wang, Jiusi Li, Tuopu Wen, Mengmeng Yang, Yiliang Xu, Diange Yang

TL;DR

EFFOcc tackles the heavy computation and labeling demands of 3D occupancy networks for autonomous driving by introducing a fusion-based OccNet and a semi-supervised distillation pipeline to transfer knowledge to a vision-only OccNet. The fusion teacher uses simple 2D operators to achieve state-of-the-art accuracy with far fewer parameters, while the multi-stage distillation leverages both labeled and unlabeled data to improve the student. On three large benchmarks, EFFOcc delivers competitive or superior performance with dramatic reductions in model size and training cost, enabling real-time occupancy prediction. This approach enhances practical deployment by achieving high accuracy with minimal labels and computation, and points to active-learning avenues for further label efficiency.

Abstract

3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and data inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38\% of the performance (mIoU = 28.38) of a 100\% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40\% labeled sequences and distillation from the fusion-based OccNet.

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 5 figures, 6 tables)

This paper contains 20 sections, 5 equations, 5 figures, 6 tables.

Introduction
Related Works
Computationally-efficient Occupancy Networks
Knowledge distillation for Autonomous Perception
Methodology
Task Formulation of 3D Occupancy Prediction
Architecture
Efficient Fusion Network
Multi-stage Occupancy Distillation
Experiments
Datasets and Metrics
Implementation Details
Results of Efficient Learning with Limited Labels
Results of Proposed Fusion-based Occupancy Network
Results on Occ3D-nuScenes
...and 5 more sections

Figures (5)

Figure 1: Graphical statistics of fusion-based teacher models and vision-only OccNets trained from scratch and trained with distillation under different labeled data scales.
Figure 2: The framework of EFFOcc. The LiDAR point cloud and multi-view images go through a fusion network for fusion-based occupancy prediction as the teacher model. The student model inputs multi-view images and distills multi-stage features from the teacher model in both BEV and 3D occupancy feature space.
Figure 3: Network details of the EFFOcc fusion-based OccNet framework compared to dense fusion OccNetsOpenOccupancyRadOcc. Our lightweight design replaces the voxel features with BEV features, OCC pooling with BEV pooling, the ResNet3D backbone with the SECOND backbone, and the complex coarse-to-fine prediction head with a simple Conv2D head.
Figure 4: Runtime efficiency analysis between RadOcc-LC, EFFOcc-R18 and EFFOcc-Swin-B. The parameters of RadOcc-LC, EFFOcc-R18, EFFOcc-Swin-B are 135.39M, 21.35M, and 111.48M. The runtime frame per second (FPS) is 0.3, 5.6, and 1.8, respectively.
Figure 5: Visualizations of fusion-based OccNet, vision-based OccNets before and after distillation. We use camera visibility mask in rendering of occupancy results.

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

TL;DR

Abstract

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

Authors

TL;DR

Abstract

Table of Contents

Figures (5)