Table of Contents
Fetching ...

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

Yining Shi, Kun Jiang, Jinyu Miao, Ke Wang, Kangan Qian, Yunlong Wang, Jiusi Li, Tuopu Wen, Mengmeng Yang, Yiliang Xu, Diange Yang

TL;DR

EFFOcc tackles the heavy computation and labeling demands of 3D occupancy networks for autonomous driving by introducing a fusion-based OccNet and a semi-supervised distillation pipeline to transfer knowledge to a vision-only OccNet. The fusion teacher uses simple 2D operators to achieve state-of-the-art accuracy with far fewer parameters, while the multi-stage distillation leverages both labeled and unlabeled data to improve the student. On three large benchmarks, EFFOcc delivers competitive or superior performance with dramatic reductions in model size and training cost, enabling real-time occupancy prediction. This approach enhances practical deployment by achieving high accuracy with minimal labels and computation, and points to active-learning avenues for further label efficiency.

Abstract

3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and data inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38\% of the performance (mIoU = 28.38) of a 100\% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40\% labeled sequences and distillation from the fusion-based OccNet.

EFFOcc: Learning Efficient Occupancy Networks from Minimal Labels for Autonomous Driving

TL;DR

EFFOcc tackles the heavy computation and labeling demands of 3D occupancy networks for autonomous driving by introducing a fusion-based OccNet and a semi-supervised distillation pipeline to transfer knowledge to a vision-only OccNet. The fusion teacher uses simple 2D operators to achieve state-of-the-art accuracy with far fewer parameters, while the multi-stage distillation leverages both labeled and unlabeled data to improve the student. On three large benchmarks, EFFOcc delivers competitive or superior performance with dramatic reductions in model size and training cost, enabling real-time occupancy prediction. This approach enhances practical deployment by achieving high accuracy with minimal labels and computation, and points to active-learning avenues for further label efficiency.

Abstract

3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and data inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38\% of the performance (mIoU = 28.38) of a 100\% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40\% labeled sequences and distillation from the fusion-based OccNet.
Paper Structure (20 sections, 5 equations, 5 figures, 6 tables)

This paper contains 20 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Graphical statistics of fusion-based teacher models and vision-only OccNets trained from scratch and trained with distillation under different labeled data scales.
  • Figure 2: The framework of EFFOcc. The LiDAR point cloud and multi-view images go through a fusion network for fusion-based occupancy prediction as the teacher model. The student model inputs multi-view images and distills multi-stage features from the teacher model in both BEV and 3D occupancy feature space.
  • Figure 3: Network details of the EFFOcc fusion-based OccNet framework compared to dense fusion OccNetsOpenOccupancyRadOcc. Our lightweight design replaces the voxel features with BEV features, OCC pooling with BEV pooling, the ResNet3D backbone with the SECOND backbone, and the complex coarse-to-fine prediction head with a simple Conv2D head.
  • Figure 4: Runtime efficiency analysis between RadOcc-LC, EFFOcc-R18 and EFFOcc-Swin-B. The parameters of RadOcc-LC, EFFOcc-R18, EFFOcc-Swin-B are 135.39M, 21.35M, and 111.48M. The runtime frame per second (FPS) is 0.3, 5.6, and 1.8, respectively.
  • Figure 5: Visualizations of fusion-based OccNet, vision-based OccNets before and after distillation. We use camera visibility mask in rendering of occupancy results.