Table of Contents
Fetching ...

OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework

Junliang Chen, Huaiyuan Xu, Yi Wang, Lap-Pui Chau

TL;DR

The paper addresses the need for efficient 3D occupancy forecasting using cameras, tackling the high computational demands of prior approaches. It introduces OccProphet, a lightweight Observer-Forecaster-Refiner framework that performs 4D feature aggregation and tripling-attention fusion to capture rich 3D spatio-temporal context. Across nuScenes, Lyft-Level5, and nuScenes-Occupancy, OccProphet delivers 58–78% lower compute, 2.6× faster inference, and 4–18% relative gains in forecasting accuracy compared to Cam4DOcc and other baselines. The work demonstrates strong potential for edge deployment and advances the state of camera-only occupancy forecasting.

Abstract

Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58\%$\sim$78\% of the computational cost with a 2.6$\times$ speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4\%$\sim$18\% relatively higher forecasting accuracy. Code and models are publicly available at https://github.com/JLChen-C/OccProphet.

OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework

TL;DR

The paper addresses the need for efficient 3D occupancy forecasting using cameras, tackling the high computational demands of prior approaches. It introduces OccProphet, a lightweight Observer-Forecaster-Refiner framework that performs 4D feature aggregation and tripling-attention fusion to capture rich 3D spatio-temporal context. Across nuScenes, Lyft-Level5, and nuScenes-Occupancy, OccProphet delivers 58–78% lower compute, 2.6× faster inference, and 4–18% relative gains in forecasting accuracy compared to Cam4DOcc and other baselines. The work demonstrates strong potential for edge deployment and advances the state of camera-only occupancy forecasting.

Abstract

Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58\%78\% of the computational cost with a 2.6 speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4\%18\% relatively higher forecasting accuracy. Code and models are publicly available at https://github.com/JLChen-C/OccProphet.

Paper Structure

This paper contains 36 sections, 5 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Illustration of OccProphet. OccProphet only receives multi-camera video input and produces future occupancies.
  • Figure 2: Comparison of performance between Cam4DOcc and OccProphet.
  • Figure 3: Overview of OccProphet. It receives multi-frame images from surround-view cameras as input and outputs future occupancy or occupancy flow. It consists of four key components: the Observer, Forecaster, Refiner, and Predictor. The Observer module aggregates spatio-temporal information. The Forecaster module conditionally generates preliminary representations of future scenarios. These preliminary representations are refined by the Refiner module. Finally, the Predictor module produces the final predictions of future occupancy or occupancy flow.
  • Figure 4: Efficient 4D Aggregation (E4A).
  • Figure 5: Tripling-Attention Fusion (left) and Tripling (right).
  • ...and 8 more figures